Skip to main content

Posts

Hive Parse JSON with Array Columns and Explode it in to Multiple rows.

 Say we have a JSON String like below -  { "billingCountry":"US" "orderItems":[       {          "itemId":1,          "product":"D1"       },   {          "itemId":2,          "product":"D2"       }    ] } And, our aim is to get output parsed like below -  itemId product 1 D1 2 D2   First, We can parse JSON as follows to get JSON String get_json_object(value, '$.orderItems.itemId') as itemId get_json_object(value, '$.orderItems.product') as product Second, Above will result String value like "[1,2]". We want to convert it to Array as follows - split(regexp_extract(get_json_object(value, '$.orderItems.itemId'),'^\\["(.*)\\"]$',1),'","') as itemId split(regexp_extract(get_json_object(value, '$.orderItems.product'),'^\\["(.*)\\"]$',1),&

Spark Disable BroadCast Join not working in case of BroadcastNestedLoopJoin

We were running an application which was leading to below error -  Caused by: org.apache.spark.SparkException: Could not execute broadcast in 300 secs. You can increase the timeout for broadcasts via spark.sql.broadcastTimeout or disable broadcast join by setting spark.sql.autoBroadcastJoinThreshold to -1 at org.apache.spark.sql.execution.exchange.BroadcastExchangeExec.doExecuteBroadcast(BroadcastExchangeExec.scala:150) at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeBroadcast$1.apply(SparkPlan.scala:154) at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeBroadcast$1.apply(SparkPlan.scala:150) Even after setting the below property, which is to disable BroadCast Join. We kept getting above error again. Set spark.sql.autoBroadcastJoinThreshold=-1 On further analysis, we found that it is not a Bug in Spark. You expect the broadcast to stop after you disable the broadcast threshold, by setting spark.sql.autoBroadcastJoinThreshold to -1, but Apache Spark tries to broadc

Spark-JDBC connection with Oracle Fails - java.sql.SQLSyntaxErrorException: ORA-00903: invalid table name

  While connecting Spark with Oracle JDBC, one may observe exception like below -  spark.read.format("jdbc"). option("url", "jdbc:oracle:thin:@//oraclehost:1521/servicename"). option("dbtable", "mytable"). option("user", "myuser").option("driver", "oracle.jdbc.driver.OracleDriver") option("password", "mypassword"). load().write.parquet("/data/out") java.sql.SQLSyntaxErrorException: ORA-00903: invalid table name at oracle.jdbc.driver.T4CTTIoer.processError(T4CTTIoer.java:447) at oracle.jdbc.driver.T4CTTIoer.processError(T4CTTIoer.java:396) at oracle.jdbc.driver.T4C8Oall.processError(T4C8Oall.java:951) at oracle.jdbc.driver.T4CTTIfun.receive(T4CTTIfun.java:513) at oracle.jdbc.driver.T4CTTIfun.doRPC(T4CTTIfun.java:227) at oracle.jdbc.driver.T4C8Oall.doOALL(T4C8Oall.java:531) at oracle.jdbc.driver.T4CPreparedStatement.doOall8(T4CPreparedStatement.java:208)

Hadoop Jobs - javax.security.sasl.SaslException: GSS initiate failed

  If you see below exception while running Big Data Jobs - Spark, MR, Tez, etc. WARNING: YARN_OPTS has been replaced by HADOOP_OPTS. Using value of YARN_OPTS. javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)] Caused by: GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt) javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)] Caused by: GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt) javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)] Solution - The simplest way is to generate kerberos Keytab & initialize same in a user session before running job -  Ex -  kinit

TEZ - How to enable Fetch Task instead of MapReduce Job for simple query in Hive

  Certain simple Hive queries can utilize fetch task, which can avoid the overhead of starting MapReduce job. hive.fetch.task.conversion This parameter controls which kind of simple query can be converted to a single fetch task. Value "none" is added in Hive 0.14 to disable this feature Value "minimal" means SELECT *, FILTER on partition columns (WHERE and HAVING clauses), LIMIT only. hive.fetch.task.conversion.threshold This parameter controls input threshold (in bytes) for applying hive.fetch.task.conversion.