QueryDB

Posts

Showing posts from January, 2020

Spark / Hive - org.apache.hadoop.hive.serde2.io.DoubleWritable cannot be cast to org.apache.hadoop.io.Text

Exception - java.lang.ClassCastException: org.apache.hadoop.hive.serde2.io.DoubleWritable cannot be cast to org.apache.hadoop.io.Text at org.apache.hadoop.hive.serde2.objectinspector.primitive.WritableStringObjectInspector.getPrimitiveWritableObject(WritableStringObjectInspector.java:41) at org.apache.spark.sql.hive.HiveInspectors$$anonfun$unwrapperFor$23.apply(HiveInspectors.scala:547) at org.apache.spark.sql.hive.HadoopTableReader$$anonfun$14$$anonfun$apply$15.apply(TableReader.scala:426) at org.apache.spark.sql.hive.HadoopTableReader$$anonfun$14$$anonfun$apply$15.apply(TableReader.scala:426) at org.apache.spark.sql.hive.HadoopTableReader$$anonfun$fillObject$2.apply(TableReader.scala:442) This exception may be occurring because underlying ORC File has a column with data Type Double, whereas Hive table has column type as String. This error can be rectified by correcting the data type.

Malformed Parquet File error with Spark or Hive SQL

Reason - This might be because either parquet data file has been corrupted, or parquet data has a Hive table defined which has wrong stored format(, may be ORC, TEXT, etc) Solution - Correct the Hive table definition or underlying data.

Caused by: java.lang.UnsupportedOperationException: org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainIntegerDictionary

Exception - Caused by: java.lang.UnsupportedOperationException: org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainIntegerDictionary at org.apache.parquet.column.Dictionary.decodeToBinary(Dictionary.java:44) at org.apache.spark.sql.execution.vectorized.ColumnVector.getUTF8String(ColumnVector.java:645) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source) Analysis - This might occur because of data type mismatch between Hive Table & written Parquet file. Solution - Correct the data type to match between Hive Table & Parquet

Apache Kudu comparison with Hive (HDFS Parquet) with Impala & Spark

Need Tight integration with Apache Impala, making it a good, mutable alternative to using HDFS with Apache Parquet. High availability like other Big Data technologies. Structured Data Model. Unlike Hadoop- which is Schema on Read, Kudu is Schema on write. That is following should be known before hand, - Primary key, Data Types of columns Partition columns, etc. It is considered as bridging gap between Hive & HBase. Can integrate with Hive Meta store. Note Following document is prepared – Not considering any future Cloudera Distribution Upgrades. Considering, we have 2.2.0.cloudera2, Hive 1.1.0-cdh5.12.2, Hadoop 2.6.0-cdh5.12.2 Kudu is just supported by Cloudera. So, we consider that, we will have an ongoing Cloudera Cluster. Kudu is considered good for – Reporting applications where new data must be immediately available for end users. Time-series applications that must support queries across large amounts of historic data while simultaneously

Kudu table Backup & Restore

There is no in-built feature in Kudu to support backup. But, backup/ restore can be made using Spark Job. Spark Job supports both full and incremental table data. You can use the KuduBackup Spark job to backup one or more Kudu tables. Common flags/ options that you can use while taking a backup: --rootPath: The root path is used to output backup data. It accepts any Spark-compatible path. --kuduMasterAddresses: Is used to specify comma-separated addresses of Kudu masters. <table>…: Used to indicate a list of tables that you want to back up. Example - below will take backup of table1 & table2 spark-submit \ --class org.apache.kudu.backup.KuduBackup \ kudu-backup2_2.11-1.10.0.jar \ --kuduMasterAddresses master1-host \ --rootPath hdfs:///kudu-backups \ Table1 Table2 You can use the KuduRestore Spark job to restore one or more Kudu tables. Common flags/options that you can use to restore tables: --rootPath: It is the root path to the backup data. --kuduMas

Spark integration with Apache Kudu

Following code piece depicts integration of Kudu and Spark. Refer - https://github.com/dinesh028/engineering/blob/master/resources/samples/Spark-Kudu-integration-code.txt //spark-shell --packages org.apache.kudu:kudu-spark2_2.11:1.10.0 import org.apache.kudu.client._ import org.apache.kudu.spark.kudu.KuduContext import collection.JavaConverters._ import org.apache.spark.sql.types.StructType import org.apache.spark.sql.types._ import org.apache.spark.sql.functions._ import org.apache.spark.sql.Row val arr= new java.util.ArrayList[Row] () arr.add(Row("jai","ganesh")) val arraySchema = new StructType().add("id",StringType, false).add("name", StringType, true) val df = spark.createDataFrame(arr,arraySchema) df.printSchema val kuduContext = new KuduContext("mymaster.devhadoop.wm.com:7051", spark.sparkContext) //This will create the table but will not insert any data kuduContext.createTable("ds.my_test_table"

Error to integrate Impala with Kudu

Create Table Failed W0106 18:18:54.640544 368440 negotiation.cc:307] Unauthorized connection attempt: Server connection negotiation failed: server connection from 172.136.38.157:35678: unauthenticated connections from publicly routable IPs are prohibited. See --trusted_subnets flag for more information.: 172.136.38.157:35678 After setting up kudu, we can enable it to work with Impala. We, can check the cluster status - kudu cluster ksck <master> The cluster doesn't have any matching tables ================== Errors: ================== error fetching info from tablet servers: Not found: No tablet servers found FAILED -- Also, tablet Server UI is not opening. Solution- This error might be because Kudu Service has to know about trusted networks, which we can set - Kudu Service Advanced Configuration Snippet (Safety Valve) for gflagfile Kudu (Service-Wide) --trusted_subnets=127.0.0.0/8,10.0.0.0/8,172.16.0.0/12,192.168.0.0/16,169.254.0

Machine Learning Part 6

In this this blog we will see - Spark Graph Frames Use of Graph Frames in Machine Learning to execute algorithms like - Page Rank A graph is made up of vertices and edges that connect them. 1. Vertices are Objects 2. Edges are relationships A regular graph is a graph where each vertex has the same number of edges. A directed graph is a graph where the edges have a direction associated with them. Example – 1. Facebook friends – A is friend of B, and so is B friend of A. 2. Instagram followers- A, B, C are followers of D. But, D may not be follower of either A, B, or C 3. Websites – Every page is a node and every linking page is an edge. Page rank Algorithm measures the importance of a page by number of links to a page and number of links to each linking page. 4. Recommendation Engines - Recommendation algorithms can use graphs where the nodes are the users and products, and their respective attributes and the edges ar