QueryDB

Posts

Setup DbVisualizer or Dbeaver or Data grip or other tools to access Secure Kerberos hadoop Cluster from remote windows machine

If you are on Windows 10 you should already be having utilities like - kinit, kutil, etc available on your machine. If not then install MIT Kerberos - http://web.mit.edu/kerberos/ Here are the steps - Copy /etc/krb5.conf from any node of your cluster to your local machine Also, copy *.keytab file from cluster to your local machine. Rename krb5.conf as krb5.ini Copy krb5.ini to - <Java_home>\jre\lib\security\ C:\Users\<User_name>\ Copy keytab file to - C:\Users\<User_name>\ On Hadoop Cluster get the principal name from keytab file - ktutil ktutil: read_kt keytab ktutil: list slot KVNO Principal ---- ---- --------------------------------------------------------------------- 1 1 <username>@HADOOP.domain.COM ktutil: quit Above highlighted is your principal name. On windows - Execute: kinit -k -t C:\Users\<User_name>\*.keytab <username>@HADOOP.domain.COM This will generate a ticket in user h

Un-escape HTML/ XML character entities

1. Below characters in XML are escape characters. For example – Escape Charters Actual Character > > < < á Latin small letter a with acute (á) T To un-escape characters use Apache Commons library. For example – //line of code println(org.apache.commons.lang3.StringEscapeUtils.unescapeHtml4("< Sáchdev Dinesh >")) //This will result - < Sáchdev Dinesh >

Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Total size of serialized results is bigger than spark.driver.maxResultSize

Exception - Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Total size of serialized results of 122266 tasks (1024.0 MB) is bigger than spark.driver.maxResultSize (1024.0 MB) at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1517) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1505) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1504) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) Cause - This happens when we try to collect a Dataframe / RDD on driver and the size of data is more than set by property. Solution - Set :- --conf "spark.driver.maxResultSize=4g"

Copy Phoenix (HBase) Table with different name

It can be done in 2 steps - Just create a new Phoenix Table with a different name but same schema as existing Table. Use below HBase command that will eventually execute a MR Job to copy the data - hbase org.apache.hadoop.hbase.mapreduce.CopyTable --new.name="<Name of new HBase Table>" "<Name of existing HBase Table>"

Access AWS S3 or HCP HS3 (Hitachi) using Hadoop or HDFS or Distcp

Create Credentials File for S3 Keys hadoop credential create fs.s3a.access.key -value <Access_KEY> -provider localjceks://file/$HOME/aws-dev-keys.jceks hadoop credential create fs.s3a.secret.key -value <Secret_KEY> -provider localjceks://file/$HOME/aws-dev-keys.jceks Where - <Access_KEY>- S3 access key <Secret_KEY> - S3 secret key Note - this will create a file local file system, in home directory with name aws-dev-keys.jceks Put this file to HDFS. For, distributed access. To list the details execute below command- hadoop credential list -provider localjceks://file/$HOME/aws-dev-keys.jceks List files in S3 Bucket with hadoop Shell hdfs dfs -Dhadoop.security.credential.provider.path=jceks://hdfs/myfilelocation/aws-dev-keys.jceks -ls s3a://s3bucketname/ hdfs dfs -Dfs.s3a.access.key=<Access_KEY> -Dfs.s3a.secret.key=<Secret_KEY> -ls s3a://aa-daas-ookla/ Note - Similarly, other hadoop/ hdfs commands