Skip to main content

Posts

Showing posts from March, 2020

Setup DbVisualizer or Dbeaver or Data grip or other tools to access Secure Kerberos hadoop Cluster from remote windows machine

If you are on Windows 10 you should already be having utilities like - kinit, kutil, etc available on your machine. If not then install MIT Kerberos - http://web.mit.edu/kerberos/ Here are the steps -  Copy /etc/krb5.conf from any node of your cluster to your local machine Also, copy *.keytab file from cluster to your local machine. Rename krb5.conf as krb5.ini Copy krb5.ini to -  <Java_home>\jre\lib\security\ C:\Users\<User_name>\ Copy keytab file to -  C:\Users\<User_name>\ On Hadoop Cluster get the principal name from keytab file -  ktutil ktutil:  read_kt keytab ktutil:  list slot KVNO Principal ---- ---- ---------------------------------------------------------------------    1    1 <username>@HADOOP.domain.COM ktutil:  quit Above highlighted is your principal name. On windows -  Execute: kinit -k -t C:\Users\<User_name>\*.keytab <us...

Un-escape HTML/ XML character entities

1.     Below characters in XML are escape characters. For example – Escape Charters Actual Character &gt; >   &lt; <   &#x00E1; Latin small letter a with acute (á) T   To un-escape characters  use  Apache Commons library.  For example –      //line of code       println(org.apache.commons.lang3.StringEscapeUtils.unescapeHtml4("&lt; S&#x00E1;chdev Dinesh &gt;"))         //This will result -     < Sáchdev Dinesh >

Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Total size of serialized results is bigger than spark.driver.maxResultSize

Exception - Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Total size of serialized results of 122266 tasks (1024.0 MB) is bigger than spark.driver.maxResultSize (1024.0 MB) at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1517) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1505) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1504) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) Cause - This happens when we try to collect a Dataframe / RDD on driver and the size of data is more than set by property. Solution - Set :- --conf "spark.driver.maxResultSize=4g"