Skip to main content

Posts

Spark / Hive - org.apache.hadoop.hive.serde2.io.DoubleWritable cannot be cast to org.apache.hadoop.io.Text

Exception - java.lang.ClassCastException: org.apache.hadoop.hive.serde2.io.DoubleWritable cannot be cast to org.apache.hadoop.io.Text at org.apache.hadoop.hive.serde2.objectinspector.primitive.WritableStringObjectInspector.getPrimitiveWritableObject(WritableStringObjectInspector.java:41) at org.apache.spark.sql.hive.HiveInspectors$$anonfun$unwrapperFor$23.apply(HiveInspectors.scala:547) at org.apache.spark.sql.hive.HadoopTableReader$$anonfun$14$$anonfun$apply$15.apply(TableReader.scala:426) at org.apache.spark.sql.hive.HadoopTableReader$$anonfun$14$$anonfun$apply$15.apply(TableReader.scala:426) at org.apache.spark.sql.hive.HadoopTableReader$$anonfun$fillObject$2.apply(TableReader.scala:442) This exception may be occurring because underlying ORC File has a column with data Type Double, whereas Hive table has column type as String. This error can be rectified by correcting the data type.

Caused by: java.lang.UnsupportedOperationException: org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainIntegerDictionary

Exception -  Caused by: java.lang.UnsupportedOperationException: org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainIntegerDictionary at org.apache.parquet.column.Dictionary.decodeToBinary(Dictionary.java:44) at org.apache.spark.sql.execution.vectorized.ColumnVector.getUTF8String(ColumnVector.java:645) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source) Analysis - This might occur because of data type mismatch between Hive Table & written Parquet file. Solution - Correct the data type to match between Hive Table & Parquet

Apache Kudu comparison with Hive (HDFS Parquet) with Impala & Spark

Need Tight integration with Apache Impala, making it a good, mutable alternative to using HDFS with Apache Parquet. High availability like other Big Data technologies. Structured Data Model. Unlike Hadoop- which is Schema on Read, Kudu is Schema on write. That is following should be known before hand, -  Primary key, Data Types of columns Partition columns, etc. It is considered as bridging gap between Hive & HBase. Can integrate with Hive Meta store. Note Following document is prepared – Not considering any future Cloudera Distribution Upgrades. Considering, we have 2.2.0.cloudera2, Hive 1.1.0-cdh5.12.2, Hadoop 2.6.0-cdh5.12.2 Kudu is just supported by Cloudera. So, we consider that, we will have an ongoing Cloudera Cluster. Kudu is considered good for –  Reporting applications where new data must be immediately available for end users. Time-series applications that must support queries across large amounts of historic data while simultaneously

Kudu table Backup & Restore

There is no in-built feature in Kudu to support backup. But, backup/ restore can be made using Spark Job. Spark Job supports both full and incremental table data. You can use the KuduBackup Spark job to backup one or more Kudu tables. Common flags/ options that you can use while taking a backup: --rootPath: The root path is used to output backup data. It accepts any Spark-compatible path. --kuduMasterAddresses: Is used to specify comma-separated addresses of Kudu masters. <table>…: Used to indicate a list of tables that you want to back up. Example - below will take backup of table1 & table2  spark-submit \ --class org.apache.kudu.backup.KuduBackup \ kudu-backup2_2.11-1.10.0.jar \ --kuduMasterAddresses master1-host \ --rootPath hdfs:///kudu-backups \ Table1 Table2 You can use the KuduRestore Spark job to restore one or more Kudu tables. Common flags/options that you can use to restore tables: --rootPath: It is the root path to the backup data. --kuduMas