Skip to main content

Posts

SQL to find Unit whose value equals and not in

Say, we have table_1 like -  unit | req_type A     |   1 A     |   2 B     |   2 B     |   3 D     |   2 E     |   2 E     |   4 We have to write a SQL such that,  we can select Unit with Req_Type=2 Also, that Unit should not have any other Req_Type in list. Ex - 1,3 Solution -  We can add another column such that -  case when  req_type in (1,3) then 1 else 0 end col1 unit | req_type | col1 A     |   1          |   1 A     |   2          |   0 B     |   2          |   0 B     |   3          |   1 D     |   2          |   0 E     |   2          |   0 E     |   4          |   0 Then we can do a group by and take count(col1) unit | c A     | 1 B     | 1 D     | 0 E     | 0 Then we can select all rows which have count = 0 Compete SQL -  select unit from (select unit, sum(col1) c  from (select unit, req_type, case when  req_type in (1,3) then 1 else 0 end col1 from  table_1) group by unit) where c=0

Non-Equi Left Join with Hive & Spark

A non-equi left join is a join between 2 tables in which join condition is given by operators other than equal to "=". For example - Equi Join - "Table1 join Table2 on (T1.c1=T2.c1)" Non-Equi Join - "Table1 join Table2 on (T2.c1 > T1.c1 and T2.c1<T1.c2)" Hive does not support Non-Equi join. Spark supports Non-Equi join. But, we have seen that it only works when there is a combination of Equi & Non-Equi columns. For example -  Table1 join Table2 on (T1.c3=T2.c3 and T2.c1 > T1.c1 and T2.c1<T1.c2) But, if you just have Non-Equi condition then Spark job fails after sometime with "Future Timeout" exception   Table1 join Table2 on (T2.c1 > T1.c1 and T2.c1<T1.c2) In that scenario, we have to identify additional column on which Equi condition can be specified . It can be a hash value column or something else that matches in Table 1 and Table 2. For example - We had 2 tables and wanted to check if IP is between sta

Spark - Data Skew - Join Optimization

Sometimes data is unevenly distributed leading to data skew. What it means is a partition has more data due to same/ related keys compared to other partitions. In case of Joins and Aggregations , all  data for same key should be co-located, may be processed by one container/ executor . This may be lead to slowness of application. Solution - If data is small than smaller data set can be broadcasted . Thus, increasing join efficiency. This is governed by property - spark.sql.autoBroadcastJoinThreshold Identify if there are too many NULL values then filter them out before joining. And , process records with NULL keys separately then do a union with renaming data set.   Salting - To understand salting, Lets understand problem with an example -  Table 1  Key 1 1 1 Table 2 Key 1 1 On joining Table1 with Table 2,  Since this is same key all data should be shuffled to same container or one JVM, which will return 3*2 rows . like  K1   K2 1 * 1

Spark - Metastore slow, or stuck or Hanged.

We have seen that Spark connects to Hive Metastore. And, sometimes it takes too long to get connected as Metastore is slow. Also, we have seen that there is no automatic load balancing between Hive Meta stores. Solution - Out of all available Hive Metastore connections- we can externally test a working Metastore in a script then set that URL while invoking Spark Job as follows - spark-sql --conf "spark.hadoop.hive.metastore.uris=thrift://myMetaIp:9083" -e "show databases" Note - This can also be used to set Hive Metastore or other properties externally from Spark.