QueryDB

Posts

Showing posts from August, 2022

Spark - Creating Sub Folder while writing to Partitioned Hive Table

We had been writing to a Partitioned Hive Table and realized that data is being written has sub-folder. For ex- Refer Table definition as below - Create table T1 ( name string, address string) Partitioned by (process_date string) stored as parquet location '/mytable/a/b/c/org=employee'; While writing to table HDFS path being written looks something like this - /mytable/a/b/c/org=employee/ process_date=20220812/ org=employee The unnecessary addition of org=employee after process_date partition is because Hive Table has location consisting "=" operator, which Hive uses as syntax to determine partition column. Re-defining Table resolves above problem - Create table T1 ( name string, address string) Partitioned by (process_date string) stored as parquet location '/mytable/a/b/c/employee';

Hive QL Spark SQL - Transform Rows into Columns

For a Structured Tabular Structure it is many a times required to transform Rows into Columns. This blog explains step by step process which can be executed as one SQL to achieve same. Lets try to understand with help of below example: where -in , we want to implement / transform input Table into table structure mentioned as output. INPUT_TABLE topic groupId batchTimeMs Partition offset Count t1 g001 1658173779 0 123 122 t1 g001 1658173779 1 2231 100 t2 g001 1658173779 0 12 11 OUTPUT_TABLE rowkey:key offset:0 count:0 offset:1 count:1 t1:g001:1658173779 123 122 2231 100 t2:g001:1658173779 12 11 NULL NULL FIRST STEP - Concat Topic, GroupID, and BatchTimeMS to create RowKey Create Columns - offsets:0, counts:0, offsets:1, counts:1. Such that Columns has value only when respective partition value matches with column name. SQL as below - select concat_ws(':', topic,groupId,batchTimeMs) as rowkey, case when partition='0' then offset else null en