Skip to main content

Posts

Spark HBase Connector (SHC) - Unsupported Primitive datatype null

While writing Spark DataFrame to HBase Table you may observe following exception - Caused by: java.lang.UnsupportedOperationException: PrimitiveType coder: unsupported data type null         at org.apache.spark.sql.execution.datasources.hbase.types.PrimitiveType.toBytes(PrimitiveType.scala:61)         at org.apache.spark.sql.execution.datasources.hbase.HBaseRelation$$anonfun$org$apache$spark$sql$execution$datasources$hbase$HBaseRelation$$convertToPut$1$1.apply(HBaseRelation.scala:213)         at org.apache.spark.sql.execution.datasources.hbase.HBaseRelation$$anonfun$org$apache$spark$sql$execution$datasources$hbase$HBaseRelation$$convertToPut$1$1.apply(HBaseRelation.scala:209)         at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) There are suggestions to Upgrade SHC-Core jar file. But, it didn't work for us. Rather it started giving following error -  Caused by: org.apache.spark.sql.execution.datasources.hbase.InvalidRegionNumberE

Spark reading Parquet table gives Null records whereas it works from Hive

Spark reading Parquet table gives Null records whereas it works from Hive. if we read a parquet table from Hive it is working - select * from Table_Parq limit 7; 1 2 3 4 5 6 7 Whereas same doesn't work with Spark - select * from Table_Parq limit 7; NULL NULL NULL NULL NULL NULL NULL It may be because Parquet file has different Schema then Hive Metastore, may be column names are in different case. Solution -  Read Parquet File on HDFS then Hive Table , or,  Set following properties -  set spark.sql.caseSensitive=false; set spark.sql.hive.convertMetastoreParquet=false;    

SQL to find Unit whose value equals and not in

Say, we have table_1 like -  unit | req_type A     |   1 A     |   2 B     |   2 B     |   3 D     |   2 E     |   2 E     |   4 We have to write a SQL such that,  we can select Unit with Req_Type=2 Also, that Unit should not have any other Req_Type in list. Ex - 1,3 Solution -  We can add another column such that -  case when  req_type in (1,3) then 1 else 0 end col1 unit | req_type | col1 A     |   1          |   1 A     |   2          |   0 B     |   2          |   0 B     |   3          |   1 D     |   2          |   0 E     |   2          |   0 E     |   4          |   0 Then we can do a group by and take count(col1) unit | c A     | 1 B     | 1 D     | 0 E     | 0 Then we can select all rows which have count = 0 Compete SQL -  select unit from (select unit, sum(col1) c  from (select unit, req_type, case when  req_type in (1,3) then 1 else 0 end col1 from  table_1) group by unit) where c=0

Non-Equi Left Join with Hive & Spark

A non-equi left join is a join between 2 tables in which join condition is given by operators other than equal to "=". For example - Equi Join - "Table1 join Table2 on (T1.c1=T2.c1)" Non-Equi Join - "Table1 join Table2 on (T2.c1 > T1.c1 and T2.c1<T1.c2)" Hive does not support Non-Equi join. Spark supports Non-Equi join. But, we have seen that it only works when there is a combination of Equi & Non-Equi columns. For example -  Table1 join Table2 on (T1.c3=T2.c3 and T2.c1 > T1.c1 and T2.c1<T1.c2) But, if you just have Non-Equi condition then Spark job fails after sometime with "Future Timeout" exception   Table1 join Table2 on (T2.c1 > T1.c1 and T2.c1<T1.c2) In that scenario, we have to identify additional column on which Equi condition can be specified . It can be a hash value column or something else that matches in Table 1 and Table 2. For example - We had 2 tables and wanted to check if IP is between sta