Skip to main content

Posts

Spark - Data Skew - Join Optimization

Sometimes data is unevenly distributed leading to data skew. What it means is a partition has more data due to same/ related keys compared to other partitions. In case of Joins and Aggregations , all  data for same key should be co-located, may be processed by one container/ executor . This may be lead to slowness of application. Solution - If data is small than smaller data set can be broadcasted . Thus, increasing join efficiency. This is governed by property - spark.sql.autoBroadcastJoinThreshold Identify if there are too many NULL values then filter them out before joining. And , process records with NULL keys separately then do a union with renaming data set.   Salting - To understand salting, Lets understand problem with an example -  Table 1  Key 1 1 1 Table 2 Key 1 1 On joining Table1 with Table 2,  Since this is same key all data should be shuffled to same container or one JVM, which will return 3*2 rows . like  K1   K2 1 * 1

Spark - Metastore slow, or stuck or Hanged.

We have seen that Spark connects to Hive Metastore. And, sometimes it takes too long to get connected as Metastore is slow. Also, we have seen that there is no automatic load balancing between Hive Meta stores. Solution - Out of all available Hive Metastore connections- we can externally test a working Metastore in a script then set that URL while invoking Spark Job as follows - spark-sql --conf "spark.hadoop.hive.metastore.uris=thrift://myMetaIp:9083" -e "show databases" Note - This can also be used to set Hive Metastore or other properties externally from Spark.

Setup DbVisualizer or Dbeaver or Data grip or other tools to access Secure Kerberos hadoop Cluster from remote windows machine

If you are on Windows 10 you should already be having utilities like - kinit, kutil, etc available on your machine. If not then install MIT Kerberos - http://web.mit.edu/kerberos/ Here are the steps -  Copy /etc/krb5.conf from any node of your cluster to your local machine Also, copy *.keytab file from cluster to your local machine. Rename krb5.conf as krb5.ini Copy krb5.ini to -  <Java_home>\jre\lib\security\ C:\Users\<User_name>\ Copy keytab file to -  C:\Users\<User_name>\ On Hadoop Cluster get the principal name from keytab file -  ktutil ktutil:  read_kt keytab ktutil:  list slot KVNO Principal ---- ---- ---------------------------------------------------------------------    1    1 <username>@HADOOP.domain.COM ktutil:  quit Above highlighted is your principal name. On windows -  Execute: kinit -k -t C:\Users\<User_name>\*.keytab <username>@HADOOP.domain.COM This will generate a ticket in user h

Un-escape HTML/ XML character entities

1.     Below characters in XML are escape characters. For example – Escape Charters Actual Character &gt; >   &lt; <   &#x00E1; Latin small letter a with acute (á) T   To un-escape characters  use  Apache Commons library.  For example –      //line of code       println(org.apache.commons.lang3.StringEscapeUtils.unescapeHtml4("&lt; S&#x00E1;chdev Dinesh &gt;"))         //This will result -     < Sáchdev Dinesh >

Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Total size of serialized results is bigger than spark.driver.maxResultSize

Exception - Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Total size of serialized results of 122266 tasks (1024.0 MB) is bigger than spark.driver.maxResultSize (1024.0 MB) at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1517) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1505) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1504) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) Cause - This happens when we try to collect a Dataframe / RDD on driver and the size of data is more than set by property. Solution - Set :- --conf "spark.driver.maxResultSize=4g"