Skip to main content

Posts

Solving Jenkins Maven Build Xray Log4J Violations

What is Xray  Identifies Open Source vulnerabilities when downloading the dependency from the cloud through Artifactory or when downloading an application from Artifactory which utilizes the vulnerable dependency. Recently, Xray scans started giving violations for my project, which stopped me from downloading build files from repository. We were facing problems related to Log4J: Included in log4j 1.2 is a socketserver class that is vulnerable to deserialization of untrusted data which can be exploited to remotely execute arbitrary code when combined with a deserialization gadget when listening to untrusted network traffic for log data. this affects log4j versions up to 1.2 up to 1.2.17. Due to above error, we were not able to download build jar file from repository -  {   "errors" : [ {     "status" : 403,     "message" : "Artifact download request rejected: com/myfile/myjarfile was not downloaded due to the download blocking policy configured in Xray

Spark Kudu - Caused by: org.apache.spark.sql.AnalysisException: Cannot up cast Database. Table. Column from string to bigint as it may truncate

  Spark Exception -  Caused by: org.apache.spark.sql.AnalysisException: Cannot up cast <Database.TableName>.`ColumnName` from string to bigint as it may truncate at  org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$17.apply(CheckAnalysis.scala:339)         at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$17.apply(CheckAnalysis.scala:331)         at scala.collection.immutable.List.foreach(List.scala:392)         at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:331)         at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:86)         at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:128)         at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:127)         at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$for

Spark Job Failure reading empty gz file, Exception- java.io.EOFException: Unexpected end of input stream

  Spark Job fails to read data from Table which has empty/ corrupt / 0 size .GZ Files with exception as below. Exception -  Caused by: java.io.EOFException: Unexpected end of input stream at org.apache.hadoop.io.compress.DecompressorStream.decompress(DecompressorStream.java:165) at org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:105) at java.io.InputStream.read(InputStream.java:101) at org.apache.hadoop.util.LineReader.fillBuffer(LineReader.java:182) at org.apache.hadoop.util.LineReader.readDefaultLine(LineReader.java:218) Solution - Remove such 0 size GZ files, or Set following property - --conf spark.sql.files.ignoreCorruptFiles=true

Hive Parse JSON with Array Columns and Explode it in to Multiple rows.

 Say we have a JSON String like below -  { "billingCountry":"US" "orderItems":[       {          "itemId":1,          "product":"D1"       },   {          "itemId":2,          "product":"D2"       }    ] } And, our aim is to get output parsed like below -  itemId product 1 D1 2 D2   First, We can parse JSON as follows to get JSON String get_json_object(value, '$.orderItems.itemId') as itemId get_json_object(value, '$.orderItems.product') as product Second, Above will result String value like "[1,2]". We want to convert it to Array as follows - split(regexp_extract(get_json_object(value, '$.orderItems.itemId'),'^\\["(.*)\\"]$',1),'","') as itemId split(regexp_extract(get_json_object(value, '$.orderItems.product'),'^\\["(.*)\\"]$',1),&

Spark Disable BroadCast Join not working in case of BroadcastNestedLoopJoin

We were running an application which was leading to below error -  Caused by: org.apache.spark.SparkException: Could not execute broadcast in 300 secs. You can increase the timeout for broadcasts via spark.sql.broadcastTimeout or disable broadcast join by setting spark.sql.autoBroadcastJoinThreshold to -1 at org.apache.spark.sql.execution.exchange.BroadcastExchangeExec.doExecuteBroadcast(BroadcastExchangeExec.scala:150) at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeBroadcast$1.apply(SparkPlan.scala:154) at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeBroadcast$1.apply(SparkPlan.scala:150) Even after setting the below property, which is to disable BroadCast Join. We kept getting above error again. Set spark.sql.autoBroadcastJoinThreshold=-1 On further analysis, we found that it is not a Bug in Spark. You expect the broadcast to stop after you disable the broadcast threshold, by setting spark.sql.autoBroadcastJoinThreshold to -1, but Apache Spark tries to broadc