QueryDB

Posts

Spark HBase Connector (SHC) - Unsupported Primitive datatype null

While writing Spark DataFrame to HBase Table you may observe following exception - Caused by: java.lang.UnsupportedOperationException: PrimitiveType coder: unsupported data type null at org.apache.spark.sql.execution.datasources.hbase.types.PrimitiveType.toBytes(PrimitiveType.scala:61) at org.apache.spark.sql.execution.datasources.hbase.HBaseRelation$$anonfun$org$apache$spark$sql$execution$datasources$hbase$HBaseRelation$$convertToPut$1$1.apply(HBaseRelation.scala:213) at org.apache.spark.sql.execution.datasources.hbase.HBaseRelation$$anonfun$org$apache$spark$sql$execution$datasources$hbase$HBaseRelation$$convertToPut$1$1.apply(HBaseRelation.scala:209) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) There are suggestions to Upgrade SHC-Core jar file. But, it didn't work for us. Rather it started giving following error - Caused by: org.apache.spark.sql.execution.datasources.hbase.InvalidRegionNumberE

Spark reading Parquet table gives Null records whereas it works from Hive

Spark reading Parquet table gives Null records whereas it works from Hive. if we read a parquet table from Hive it is working - select * from Table_Parq limit 7; 1 2 3 4 5 6 7 Whereas same doesn't work with Spark - select * from Table_Parq limit 7; NULL NULL NULL NULL NULL NULL NULL It may be because Parquet file has different Schema then Hive Metastore, may be column names are in different case. Solution - Read Parquet File on HDFS then Hive Table , or, Set following properties - set spark.sql.caseSensitive=false; set spark.sql.hive.convertMetastoreParquet=false;

Spark 2.2 - Read AVRO file

Add following Library to class path and set "avro.mapred.ignore.inputs.without.extension" to false - spark-shell --packages com.databricks:spark-avro_2.11:4.0.0 --conf "spark.hadoop. avro.mapred.ignore.inputs.without.extension=false "

SQL to find Unit whose value equals and not in

Say, we have table_1 like - unit | req_type A | 1 A | 2 B | 2 B | 3 D | 2 E | 2 E | 4 We have to write a SQL such that, we can select Unit with Req_Type=2 Also, that Unit should not have any other Req_Type in list. Ex - 1,3 Solution - We can add another column such that - case when req_type in (1,3) then 1 else 0 end col1 unit | req_type | col1 A | 1 | 1 A | 2 | 0 B | 2 | 0 B | 3 | 1 D | 2 | 0 E | 2 | 0 E | 4 | 0 Then we can do a group by and take count(col1) unit | c A | 1 B | 1 D | 0 E | 0 Then we can select all rows which have count = 0 Compete SQL - select unit from (select unit, sum(col1) c from (select unit, req_type, case when req_type in (1,3) then 1 else 0 end col1 from table_1) group by unit) where c=0

Non-Equi Left Join with Hive & Spark

A non-equi left join is a join between 2 tables in which join condition is given by operators other than equal to "=". For example - Equi Join - "Table1 join Table2 on (T1.c1=T2.c1)" Non-Equi Join - "Table1 join Table2 on (T2.c1 > T1.c1 and T2.c1<T1.c2)" Hive does not support Non-Equi join. Spark supports Non-Equi join. But, we have seen that it only works when there is a combination of Equi & Non-Equi columns. For example - Table1 join Table2 on (T1.c3=T2.c3 and T2.c1 > T1.c1 and T2.c1<T1.c2) But, if you just have Non-Equi condition then Spark job fails after sometime with "Future Timeout" exception Table1 join Table2 on (T2.c1 > T1.c1 and T2.c1<T1.c2) In that scenario, we have to identify additional column on which Equi condition can be specified . It can be a hash value column or something else that matches in Table 1 and Table 2. For example - We had 2 tables and wanted to check if IP is between sta