Skip to main content

Posts

Showing posts from August, 2022

Spark - Creating Sub Folder while writing to Partitioned Hive Table

  We had been writing to a Partitioned Hive Table and realized that data is being written has sub-folder. For ex- Refer Table definition as below -  Create table T1 ( name string, address string) Partitioned by (process_date string) stored as parquet location '/mytable/a/b/c/org=employee'; While writing to table HDFS path being written looks something like this -  /mytable/a/b/c/org=employee/ process_date=20220812/ org=employee The unnecessary addition of   org=employee after  process_date partition is because Hive Table has location consisting "=" operator, which Hive uses as syntax to determine partition column. Re-defining Table resolves above problem -  Create table T1 ( name string, address string) Partitioned by (process_date string) stored as parquet location '/mytable/a/b/c/employee';

Hive QL Spark SQL - Transform Rows into Columns

  For a Structured Tabular Structure it is many a times required to transform Rows into Columns. This blog explains step by step process which can be executed as one SQL to achieve same.  Lets try to understand with help of below example: where -in , we want to implement / transform input Table into table structure mentioned as output. INPUT_TABLE   topic groupId batchTimeMs Partition  offset  Count  t1  g001  1658173779  0 123  122 t1 g001  1658173779  1 2231 100 t2 g001  1658173779  0 12 11 OUTPUT_TABLE   rowkey:key offset:0 count:0     offset:1  count:1  t1:g001:1658173779  123 122 2231 100 t2:g001:1658173779  12 11 NULL NULL   FIRST STEP - Concat Topic, GroupID, and BatchTimeMS to create RowKey  Create Columns - offsets:0, counts:0, offsets:1, counts:1. Such that Columns has value only when respective partition value matches with column name. SQL as below - select concat_ws(':', topic,groupId,batchTimeMs) as rowkey, case when partition='0' then offset else null en