Skip to main content

Posts

Spark Job Failure reading empty gz file, Exception- java.io.EOFException: Unexpected end of input stream

  Spark Job fails to read data from Table which has empty/ corrupt / 0 size .GZ Files with exception as below. Exception -  Caused by: java.io.EOFException: Unexpected end of input stream at org.apache.hadoop.io.compress.DecompressorStream.decompress(DecompressorStream.java:165) at org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:105) at java.io.InputStream.read(InputStream.java:101) at org.apache.hadoop.util.LineReader.fillBuffer(LineReader.java:182) at org.apache.hadoop.util.LineReader.readDefaultLine(LineReader.java:218) Solution - Remove such 0 size GZ files, or Set following property - --conf spark.sql.files.ignoreCorruptFiles=true

Hive Parse JSON with Array Columns and Explode it in to Multiple rows.

 Say we have a JSON String like below -  { "billingCountry":"US" "orderItems":[       {          "itemId":1,          "product":"D1"       },   {          "itemId":2,          "product":"D2"       }    ] } And, our aim is to get output parsed like below -  itemId product 1 D1 2 D2   First, We can parse JSON as follows to get JSON String get_json_object(value, '$.orderItems.itemId') as itemId get_json_object(value, '$.orderItems.product') as product Second, Above will result String value like "[1,2]". We want to convert it to Array as follows - split(regexp_extract(get_json_object(value, '$.orderItems.itemId'),'^\\["(.*)\\"]$',1),'","') as itemId split(regexp_extract(get_json_object(value, '$.orderItems.product'),'^\\["(.*)\\"]$',1),&

Spark Disable BroadCast Join not working in case of BroadcastNestedLoopJoin

We were running an application which was leading to below error -  Caused by: org.apache.spark.SparkException: Could not execute broadcast in 300 secs. You can increase the timeout for broadcasts via spark.sql.broadcastTimeout or disable broadcast join by setting spark.sql.autoBroadcastJoinThreshold to -1 at org.apache.spark.sql.execution.exchange.BroadcastExchangeExec.doExecuteBroadcast(BroadcastExchangeExec.scala:150) at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeBroadcast$1.apply(SparkPlan.scala:154) at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeBroadcast$1.apply(SparkPlan.scala:150) Even after setting the below property, which is to disable BroadCast Join. We kept getting above error again. Set spark.sql.autoBroadcastJoinThreshold=-1 On further analysis, we found that it is not a Bug in Spark. You expect the broadcast to stop after you disable the broadcast threshold, by setting spark.sql.autoBroadcastJoinThreshold to -1, but Apache Spark tries to broadc

Spark-JDBC connection with Oracle Fails - java.sql.SQLSyntaxErrorException: ORA-00903: invalid table name

  While connecting Spark with Oracle JDBC, one may observe exception like below -  spark.read.format("jdbc"). option("url", "jdbc:oracle:thin:@//oraclehost:1521/servicename"). option("dbtable", "mytable"). option("user", "myuser").option("driver", "oracle.jdbc.driver.OracleDriver") option("password", "mypassword"). load().write.parquet("/data/out") java.sql.SQLSyntaxErrorException: ORA-00903: invalid table name at oracle.jdbc.driver.T4CTTIoer.processError(T4CTTIoer.java:447) at oracle.jdbc.driver.T4CTTIoer.processError(T4CTTIoer.java:396) at oracle.jdbc.driver.T4C8Oall.processError(T4C8Oall.java:951) at oracle.jdbc.driver.T4CTTIfun.receive(T4CTTIfun.java:513) at oracle.jdbc.driver.T4CTTIfun.doRPC(T4CTTIfun.java:227) at oracle.jdbc.driver.T4C8Oall.doOALL(T4C8Oall.java:531) at oracle.jdbc.driver.T4CPreparedStatement.doOall8(T4CPreparedStatement.java:208)

Hadoop Jobs - javax.security.sasl.SaslException: GSS initiate failed

  If you see below exception while running Big Data Jobs - Spark, MR, Tez, etc. WARNING: YARN_OPTS has been replaced by HADOOP_OPTS. Using value of YARN_OPTS. javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)] Caused by: GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt) javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)] Caused by: GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt) javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)] Solution - The simplest way is to generate kerberos Keytab & initialize same in a user session before running job -  Ex -  kinit