Hive QL Spark SQL - Transform Rows into Columns

For a Structured Tabular Structure it is many a times required to transform Rows into Columns. This blog explains step by step process which can be executed as one SQL to achieve same.

Lets try to understand with help of below example: where -in , we want to implement / transform input Table into table structure mentioned as output.

INPUT_TABLE

topic	groupId	batchTimeMs	Partition	offset	Count
t1	g001	1658173779	0	123	122
t1	g001	1658173779	1	2231	100
t2	g001	1658173779	0	12	11

OUTPUT_TABLE

rowkey:key	offset:0	count:0	offset:1	count:1
t1:g001:1658173779	123	122	2231	100
t2:g001:1658173779	12	11	NULL	NULL

FIRST STEP -

Concat Topic, GroupID, and BatchTimeMS to create RowKey
Create Columns - offsets:0, counts:0, offsets:1, counts:1. Such that Columns has value only when respective partition value matches with column name.
SQL as below -

select concat_ws(':', topic,groupId,batchTimeMs) as rowkey,

case when partition='0' then offset else null end as `offsets:0`,

case when partition='0' then count else null end as `counts:0`,

case when partition='1' then offset else null end as `offsets:1`,

case when partition='1' then count else null end as `counts:1`

FROM INPUT_TABLE

rowkey	offset:0	count:0	offset:1	count:1
t1:g001:1658173779	123	122	NULL	NULL
t1:g001:1658173779	NULL	NULL	2231	100
t2:g001:1658173779	12	11	NULL	NULL

SECOND STEP-

Bring in all values of a ROWKEY in to one row.
SQL as below -

select rowkey as `rowkey:key`,

collect_set(`offsets:0`) as `offsets:0`,

collect_set(`counts:0`) as `counts:0`,

collect_set(`offsets:1`) as `offsets:1`,

collect_set(`counts:1`) as `counts:1` FROM (

select concat_ws(':', topic,groupId,batchTimeMs) as rowkey,

case when partition='0' then offset else null end as `offsets:0`,

case when partition='0' then count else null end as `counts:0`,

case when partition='1' then offset else null end as `offsets:1`,

case when partition='1' then count else null end as `counts:1`

FROM INPUT_TABLE ) T1 group by rowkey

rowkey:key	offset:0	count:0	offset:1	count:1
t1:g001:1658173779	[123, NULL]	[122, NULL]	[2231,NULL]	[100,NULL]
t2:g001:1658173779	[12]	[11]	[NULL]	[NULL]

THIRD STEP -

Select only first value from Array of values.
SQL as below resulting in final desired output -

select rowkey as `rowkey:key`,

collect_set(`offsets:0`) [0] as `offsets:0`,

collect_set(`counts:0`) [0] as `counts:0`,

collect_set(`offsets:1`) [0] as `offsets:1`,

collect_set(`counts:1`) [0] as `counts:1` FROM (

select concat_ws(':', topic,groupId,batchTimeMs) as rowkey,

case when partition='0' then offset else null end as `offsets:0`,

case when partition='0' then count else null end as `counts:0`,

case when partition='1' then offset else null end as `offsets:1`,

case when partition='1' then count else null end as `counts:1`

FROM INPUT_TABLE ) T1 group by rowkey

rowkey:key	offset:0	count:0	offset:1	count:1
t1:g001:1658173779	123	122	2231	100
t2:g001:1658173779	12	11	NULL	NULL

QueryDB

Search This Blog

Hive QL Spark SQL - Transform Rows into Columns

Comments

Post a Comment

Popular posts

Spark MongoDB Connector Not leading to correct count or data while reading

Scala Spark building Jar leads java.lang.StackOverflowError

MongoDB Chunk size many times bigger than configure chunksize (128 MB)

AWS EMR Spark – Much Larger Executors are Created than Requested

Hive Count Query not working