Skip to main content

Posts

Showing posts from July, 2017

Hive Analytics Functions - rank() vs dense_rank() vs percent_rank() vs row_umber() vs cume_dist()

RANK - Rank of each row within partition of result set. DENSE_RANK - Mostly, similar to RANK. But, there will be no gaps in ranking. PERCENT_RANK - Relative Rank of row within group of rows. ROW_NUMBER - Sequential number of row within partition of a result set. CUME_DIST - For row r, the number of rows with value lower than or equal to value of r , divided by number of rows evaluated in partition. Practice - hive> create table test (v string ) row format delimited fields terminated by ','; hive> alter table test add columns (t string); hive> load data local inpath '/root/test' overwrite into table test; test data in local looks like below - a,1 a,2 a,3 a,1 a,2 b,1 c,1 c,2 d,1 e,1 Execute below query and analyze the result - hive> select v, t, rank() over (partition by v ), dense_rank() over (partition by v  ), row_number() over (partition by v  ), percent_rank()over (partition by v  ), cume_dist() over (partition by v  )