Skip to main content

Posts

Spark HBase Connector (SHC) vs HBase-Spark Connector, Cloudera vs Hortonworks

  Several integrations for accessing HBase from Spark have occurred in the past. The first experimental connector was developed by Cloudera Professional Services, which was called Spark on HBase. Cloudera included a derivative of this community version ( called hbase-spark)   in both CDH 5 and CDH 6 Hortonworks also came up with an implementation, it was called SHC (Spark HBase connector). SHC was supported by HDP & CDP. But with CDP 7: Spark HBase Connector (SHC) is no longer supported in CDP. Refer https://docs.cloudera.com/runtime/7.2.0/hbase-overview/topics/hbase-on-cdp.html Refer below for compatibility- Implementation Spark Distribution hbase-spark 1.6 CDH 5 hbase-spark 2.4 CDH 6 hbase-spark 2.3 HDP 2.6, HDP 3.1 SHC 2.3 HDP 2.6, HDP 3.1 hbase-connectors 2.4 CDP Reference - https://community.cloudera.com/t5/Community-Articles/HBase-Spark-in-CDP/ta-p/294868

Solving Jenkins Maven Build Xray Log4J Violations

What is Xray  Identifies Open Source vulnerabilities when downloading the dependency from the cloud through Artifactory or when downloading an application from Artifactory which utilizes the vulnerable dependency. Recently, Xray scans started giving violations for my project, which stopped me from downloading build files from repository. We were facing problems related to Log4J: Included in log4j 1.2 is a socketserver class that is vulnerable to deserialization of untrusted data which can be exploited to remotely execute arbitrary code when combined with a deserialization gadget when listening to untrusted network traffic for log data. this affects log4j versions up to 1.2 up to 1.2.17. Due to above error, we were not able to download build jar file from repository -  {   "errors" : [ {     "status" : 403,     "message" : "Artifact download request rejected: com/myfile/myjarfile was not downloaded due to the download blocking policy configured in Xray

Spark Kudu - Caused by: org.apache.spark.sql.AnalysisException: Cannot up cast Database. Table. Column from string to bigint as it may truncate

  Spark Exception -  Caused by: org.apache.spark.sql.AnalysisException: Cannot up cast <Database.TableName>.`ColumnName` from string to bigint as it may truncate at  org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$17.apply(CheckAnalysis.scala:339)         at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$17.apply(CheckAnalysis.scala:331)         at scala.collection.immutable.List.foreach(List.scala:392)         at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:331)         at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:86)         at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:128)         at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:127)         at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$for

Spark Job Failure reading empty gz file, Exception- java.io.EOFException: Unexpected end of input stream

  Spark Job fails to read data from Table which has empty/ corrupt / 0 size .GZ Files with exception as below. Exception -  Caused by: java.io.EOFException: Unexpected end of input stream at org.apache.hadoop.io.compress.DecompressorStream.decompress(DecompressorStream.java:165) at org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:105) at java.io.InputStream.read(InputStream.java:101) at org.apache.hadoop.util.LineReader.fillBuffer(LineReader.java:182) at org.apache.hadoop.util.LineReader.readDefaultLine(LineReader.java:218) Solution - Remove such 0 size GZ files, or Set following property - --conf spark.sql.files.ignoreCorruptFiles=true

Hive Parse JSON with Array Columns and Explode it in to Multiple rows.

 Say we have a JSON String like below -  { "billingCountry":"US" "orderItems":[       {          "itemId":1,          "product":"D1"       },   {          "itemId":2,          "product":"D2"       }    ] } And, our aim is to get output parsed like below -  itemId product 1 D1 2 D2   First, We can parse JSON as follows to get JSON String get_json_object(value, '$.orderItems.itemId') as itemId get_json_object(value, '$.orderItems.product') as product Second, Above will result String value like "[1,2]". We want to convert it to Array as follows - split(regexp_extract(get_json_object(value, '$.orderItems.itemId'),'^\\["(.*)\\"]$',1),'","') as itemId split(regexp_extract(get_json_object(value, '$.orderItems.product'),'^\\["(.*)\\"]$',1),&