QueryDB

Posts

Non-Equi Left Join with Hive & Spark

A non-equi left join is a join between 2 tables in which join condition is given by operators other than equal to "=". For example - Equi Join - "Table1 join Table2 on (T1.c1=T2.c1)" Non-Equi Join - "Table1 join Table2 on (T2.c1 > T1.c1 and T2.c1<T1.c2)" Hive does not support Non-Equi join. Spark supports Non-Equi join. But, we have seen that it only works when there is a combination of Equi & Non-Equi columns. For example - Table1 join Table2 on (T1.c3=T2.c3 and T2.c1 > T1.c1 and T2.c1<T1.c2) But, if you just have Non-Equi condition then Spark job fails after sometime with "Future Timeout" exception Table1 join Table2 on (T2.c1 > T1.c1 and T2.c1<T1.c2) In that scenario, we have to identify additional column on which Equi condition can be specified . It can be a hash value column or something else that matches in Table 1 and Table 2. For example - We had 2 tables and wanted to check if IP is between sta

Spark - Data Skew - Join Optimization

Sometimes data is unevenly distributed leading to data skew. What it means is a partition has more data due to same/ related keys compared to other partitions. In case of Joins and Aggregations , all data for same key should be co-located, may be processed by one container/ executor . This may be lead to slowness of application. Solution - If data is small than smaller data set can be broadcasted . Thus, increasing join efficiency. This is governed by property - spark.sql.autoBroadcastJoinThreshold Identify if there are too many NULL values then filter them out before joining. And , process records with NULL keys separately then do a union with renaming data set. Salting - To understand salting, Lets understand problem with an example - Table 1 Key 1 1 1 Table 2 Key 1 1 On joining Table1 with Table 2, Since this is same key all data should be shuffled to same container or one JVM, which will return 3*2 rows . like K1 K2 1 * 1

Spark - Metastore slow, or stuck or Hanged.

We have seen that Spark connects to Hive Metastore. And, sometimes it takes too long to get connected as Metastore is slow. Also, we have seen that there is no automatic load balancing between Hive Meta stores. Solution - Out of all available Hive Metastore connections- we can externally test a working Metastore in a script then set that URL while invoking Spark Job as follows - spark-sql --conf "spark.hadoop.hive.metastore.uris=thrift://myMetaIp:9083" -e "show databases" Note - This can also be used to set Hive Metastore or other properties externally from Spark.

Setup DbVisualizer or Dbeaver or Data grip or other tools to access Secure Kerberos hadoop Cluster from remote windows machine

If you are on Windows 10 you should already be having utilities like - kinit, kutil, etc available on your machine. If not then install MIT Kerberos - http://web.mit.edu/kerberos/ Here are the steps - Copy /etc/krb5.conf from any node of your cluster to your local machine Also, copy *.keytab file from cluster to your local machine. Rename krb5.conf as krb5.ini Copy krb5.ini to - <Java_home>\jre\lib\security\ C:\Users\<User_name>\ Copy keytab file to - C:\Users\<User_name>\ On Hadoop Cluster get the principal name from keytab file - ktutil ktutil: read_kt keytab ktutil: list slot KVNO Principal ---- ---- --------------------------------------------------------------------- 1 1 <username>@HADOOP.domain.COM ktutil: quit Above highlighted is your principal name. On windows - Execute: kinit -k -t C:\Users\<User_name>\*.keytab <username>@HADOOP.domain.COM This will generate a ticket in user h

Un-escape HTML/ XML character entities

1. Below characters in XML are escape characters. For example – Escape Charters Actual Character > > < < á Latin small letter a with acute (á) T To un-escape characters use Apache Commons library. For example – //line of code println(org.apache.commons.lang3.StringEscapeUtils.unescapeHtml4("< Sáchdev Dinesh >")) //This will result - < Sáchdev Dinesh >