QueryDB

Posts

Spark - Metastore slow, or stuck or Hanged.

We have seen that Spark connects to Hive Metastore. And, sometimes it takes too long to get connected as Metastore is slow. Also, we have seen that there is no automatic load balancing between Hive Meta stores. Solution - Out of all available Hive Metastore connections- we can externally test a working Metastore in a script then set that URL while invoking Spark Job as follows - spark-sql --conf "spark.hadoop.hive.metastore.uris=thrift://myMetaIp:9083" -e "show databases" Note - This can also be used to set Hive Metastore or other properties externally from Spark.

Setup DbVisualizer or Dbeaver or Data grip or other tools to access Secure Kerberos hadoop Cluster from remote windows machine

If you are on Windows 10 you should already be having utilities like - kinit, kutil, etc available on your machine. If not then install MIT Kerberos - http://web.mit.edu/kerberos/ Here are the steps - Copy /etc/krb5.conf from any node of your cluster to your local machine Also, copy *.keytab file from cluster to your local machine. Rename krb5.conf as krb5.ini Copy krb5.ini to - <Java_home>\jre\lib\security\ C:\Users\<User_name>\ Copy keytab file to - C:\Users\<User_name>\ On Hadoop Cluster get the principal name from keytab file - ktutil ktutil: read_kt keytab ktutil: list slot KVNO Principal ---- ---- --------------------------------------------------------------------- 1 1 <username>@HADOOP.domain.COM ktutil: quit Above highlighted is your principal name. On windows - Execute: kinit -k -t C:\Users\<User_name>\*.keytab <username>@HADOOP.domain.COM This will generate a ticket in user h

Un-escape HTML/ XML character entities

1. Below characters in XML are escape characters. For example – Escape Charters Actual Character > > < < á Latin small letter a with acute (á) T To un-escape characters use Apache Commons library. For example – //line of code println(org.apache.commons.lang3.StringEscapeUtils.unescapeHtml4("< Sáchdev Dinesh >")) //This will result - < Sáchdev Dinesh >

Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Total size of serialized results is bigger than spark.driver.maxResultSize

Exception - Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Total size of serialized results of 122266 tasks (1024.0 MB) is bigger than spark.driver.maxResultSize (1024.0 MB) at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1517) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1505) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1504) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) Cause - This happens when we try to collect a Dataframe / RDD on driver and the size of data is more than set by property. Solution - Set :- --conf "spark.driver.maxResultSize=4g"

Copy Phoenix (HBase) Table with different name

It can be done in 2 steps - Just create a new Phoenix Table with a different name but same schema as existing Table. Use below HBase command that will eventually execute a MR Job to copy the data - hbase org.apache.hadoop.hbase.mapreduce.CopyTable --new.name="<Name of new HBase Table>" "<Name of existing HBase Table>"