Skip to main content

Posts

Showing posts from 2021

Spark Hive ORC Exception Caused by: java.util.concurrent.ExecutionException: com.google.protobuf.InvalidProtocolBufferException: Protocol message tag had invalid wire type.

  Exception  Caused by: java.util.concurrent.ExecutionException: com.google.protobuf.InvalidProtocolBufferException: Protocol message tag had invalid wire type. at java.util.concurrent.FutureTask.report(FutureTask.java:122) at java.util.concurrent.FutureTask.get(FutureTask.java:192) at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.generateSplitsInfo(OrcInputFormat.java:1865) ... 17 more Caused by: com.google.protobuf.InvalidProtocolBufferException: Protocol message tag had invalid wire type. at com.google.protobuf.InvalidProtocolBufferException.invalidWireType(InvalidProtocolBufferException.java:99) Reason You might receive above error while performing SQL operations using Spark or Hive. Above error results because there  may be some corrupt ORC Files or Unsupported version of ORC on HDFS. Solution Identify and remove corrupt or incorrect files from HDFS. Or, With Spark:- You can ignore such files by setting following property -  set spark.sql.hive.convertMetastoreOrc=true

Log4J JNDI Vulnerability

  This post is an extension of  https://querydb.blogspot.com/2021/09/solving-jenkins-maven-build-xray-log4j.html Apart from fix that was discussed in https://querydb.blogspot.com/2021/09/solving-jenkins-maven-build-xray-log4j.html . It is required to upgrade Log4J to 2.15.0 or above due to JNDI attack.  Refer below figure to understand the  deserialization of untrusted data which can be exploited to remotely execute arbitrary code. There are certain posts which suggest to set below property  log4j2.formatMsgNoLookups But, that's  serious vulnerability, you shouldn't contemplate these workarounds and upgrade Log4j jars. Refer  https://logging.apache.org/log4j/2.x/security.html " A new CVE (CVE-2021-45046, see above) was raised for this. Other insufficient mitigation measures are: setting system property log4j2.formatMsgNoLookups or environment variable LOG4J_FORMAT_MSG_NO_LOOKUPS to true for releases >= 2.10, or modifying the logging configuration to disable message look

Run Kafka Console Consumer with Secured Kafka

  1) Create jaas.conf KafkaClient { com.sun.security.auth.module.Krb5LoginModule required doNotPrompt=true useTicketCache=false principal="principalName@domain" useKeyTab=true serviceName="kafka" keyTab="my.keytab" client=true; }; Client { com.sun.security.auth.module.Krb5LoginModule required doNotPrompt=true useTicketCache=false principal="principalName@domain" useKeyTab=true serviceName="kafka" keyTab="my.keytab" client=true; }; 2) Create consumer.properties sasl.mechanism=GSSAPI security.protocol=SASL_SSL sasl.kerberos.service.name=kafka ssl.truststore.location=truststore.jks ssl.truststore.password=changeit group.id=consumer-group-name2 3) Execute following - >export KAFKA_OPTS="-Djava.security.auth.login.config=/path/to/your/jaas.conf" >sh kafka-console-consumer.sh --bootstrap-server kafkabroker.charter.com:6668  --topic TopicName --new-consumer --from-beginning --consumer.config /path/to/consumer.properti

SSH Issue: no matching key exchange method found

  While doing SSH, we received below error -  Unable to negotiate with 22.33.18.90 port 22: no matching key exchange method found. Their offer: diffie-hellman-group-exchange-sha1,diffie-hellman-group14-sha1,diffie-hellman-group1-sha1 Solution Update ssh command as below -  ssh -oKexAlgorithms=+diffie-hellman-group1-sha1 username@22.33.18.90 Or, permanently adding below to ~/.ssh/config Host 22.33.18.90 KexAlgorithms +diffie-hellman-group1-sha1

Talend TSSH Component Issue - Cannot negotiate, proposals do not match

  Talend TSSH Component failing with error as below -  java.io.IOException: There was a problem while connecting to localhost:22        at ch.ethz.ssh2.Connection.connect(Connection.java:805)        at ch.ethz.ssh2.Connection.connect(Connection.java:595)    ......        ...... Caused by: java.io.IOException: Key exchange was not finished, connection is closed.        at ch.ethz.ssh2.transport.KexManager.getOrWaitForConnectionInfo(KexManager.java:78)        at ch.ethz.ssh2.transport.TransportManager.getConnectionInfo(TransportManager.java:281)        at ch.ethz.ssh2.Connection.connect(Connection.java:761)        ... 5 more Caused by: java.io.IOException: Cannot negotiate, proposals do not match.        at ch.ethz.ssh2.transport.ClientKexManager.handleMessage(ClientKexManager.java:123)        at ch.ethz.ssh2.transport.TransportManager.receiveLoop(TransportManager.java:941)        at ch.ethz.ssh2.transport.TransportManager$1.run(TransportManager.java:510)        at java.lang.Thread.run

WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED

While trying  SFTP/ SSH: One may observe below error : @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ @    WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED!     @ @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ IT IS POSSIBLE THAT SOMEONE IS DOING SOMETHING NASTY! Someone could be eavesdropping on you right now (man-in-the-middle attack)! It is also possible that a host key has just been changed. The fingerprint for the RSA key sent by the remote host is SHA256:7nYKseap6dAOZ2jb+ExnHUluNtqbz46AUXw14NCO1hk. Please contact your system administrator. Add correct host key in /home/myuser/.ssh/known_hosts to get rid of this message. Offending RSA key in /home/myuser/.ssh/known_hosts:42 RSA host key for mysftp.sftp.com has changed and you have requested strict checking. Host key verification failed. Couldn't read packet: Connection reset by peer Solution - Do one of the following- use ssh-keygen to delete the invalid key             ssh-keygen -R "you server h

Spark HBase Connector CDP Issue - java.lang.ClassNotFoundException: org.apache.hadoop.hbase.spark.SparkSQLPushDownFilter

  We wrote the Spark code to read data from using HBase-Connector as below -  val sql = spark.sqlContext val df = sql.read.format("org.apache.hadoop.hbase.spark")  .option("hbase.columns.mapping",    "name STRING :key, email STRING c:email, " +      "birthDate DATE p:birthDate, height FLOAT p:height")  .option("hbase.table", "person")  .option("hbase.spark.use.hbasecontext", false)  .load() df.createOrReplaceTempView("personView") val results = sql.sql("SELECT * FROM personView") results.show() Above code works fine. But, if we add a where clause to SQL above, it gives error as below -  val results = sql.sql("SELECT * FROM personView where name='Jaiganesh'") results.show() Error -  Caused by: org.apache.hadoop.hbase.ipc.RemoteWithExtrasException(org.apache.hadoop.hbase.DoNotRetryIOException): org.apache.hadoop.hbase.DoNotRetryIOException: java.lang.ClassNotFoundException: org.apa

Spark HBase Connector (SHC) vs HBase-Spark Connector, Cloudera vs Hortonworks

  Several integrations for accessing HBase from Spark have occurred in the past. The first experimental connector was developed by Cloudera Professional Services, which was called Spark on HBase. Cloudera included a derivative of this community version ( called hbase-spark)   in both CDH 5 and CDH 6 Hortonworks also came up with an implementation, it was called SHC (Spark HBase connector). SHC was supported by HDP & CDP. But with CDP 7: Spark HBase Connector (SHC) is no longer supported in CDP. Refer https://docs.cloudera.com/runtime/7.2.0/hbase-overview/topics/hbase-on-cdp.html Refer below for compatibility- Implementation Spark Distribution hbase-spark 1.6 CDH 5 hbase-spark 2.4 CDH 6 hbase-spark 2.3 HDP 2.6, HDP 3.1 SHC 2.3 HDP 2.6, HDP 3.1 hbase-connectors 2.4 CDP Reference - https://community.cloudera.com/t5/Community-Articles/HBase-Spark-in-CDP/ta-p/294868

Solving Jenkins Maven Build Xray Log4J Violations

What is Xray  Identifies Open Source vulnerabilities when downloading the dependency from the cloud through Artifactory or when downloading an application from Artifactory which utilizes the vulnerable dependency. Recently, Xray scans started giving violations for my project, which stopped me from downloading build files from repository. We were facing problems related to Log4J: Included in log4j 1.2 is a socketserver class that is vulnerable to deserialization of untrusted data which can be exploited to remotely execute arbitrary code when combined with a deserialization gadget when listening to untrusted network traffic for log data. this affects log4j versions up to 1.2 up to 1.2.17. Due to above error, we were not able to download build jar file from repository -  {   "errors" : [ {     "status" : 403,     "message" : "Artifact download request rejected: com/myfile/myjarfile was not downloaded due to the download blocking policy configured in Xray

Spark Kudu - Caused by: org.apache.spark.sql.AnalysisException: Cannot up cast Database. Table. Column from string to bigint as it may truncate

  Spark Exception -  Caused by: org.apache.spark.sql.AnalysisException: Cannot up cast <Database.TableName>.`ColumnName` from string to bigint as it may truncate at  org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$17.apply(CheckAnalysis.scala:339)         at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$17.apply(CheckAnalysis.scala:331)         at scala.collection.immutable.List.foreach(List.scala:392)         at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:331)         at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:86)         at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:128)         at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:127)         at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$for

Spark Job Failure reading empty gz file, Exception- java.io.EOFException: Unexpected end of input stream

  Spark Job fails to read data from Table which has empty/ corrupt / 0 size .GZ Files with exception as below. Exception -  Caused by: java.io.EOFException: Unexpected end of input stream at org.apache.hadoop.io.compress.DecompressorStream.decompress(DecompressorStream.java:165) at org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:105) at java.io.InputStream.read(InputStream.java:101) at org.apache.hadoop.util.LineReader.fillBuffer(LineReader.java:182) at org.apache.hadoop.util.LineReader.readDefaultLine(LineReader.java:218) Solution - Remove such 0 size GZ files, or Set following property - --conf spark.sql.files.ignoreCorruptFiles=true

Hive Parse JSON with Array Columns and Explode it in to Multiple rows.

 Say we have a JSON String like below -  { "billingCountry":"US" "orderItems":[       {          "itemId":1,          "product":"D1"       },   {          "itemId":2,          "product":"D2"       }    ] } And, our aim is to get output parsed like below -  itemId product 1 D1 2 D2   First, We can parse JSON as follows to get JSON String get_json_object(value, '$.orderItems.itemId') as itemId get_json_object(value, '$.orderItems.product') as product Second, Above will result String value like "[1,2]". We want to convert it to Array as follows - split(regexp_extract(get_json_object(value, '$.orderItems.itemId'),'^\\["(.*)\\"]$',1),'","') as itemId split(regexp_extract(get_json_object(value, '$.orderItems.product'),'^\\["(.*)\\"]$',1),&

Spark Disable BroadCast Join not working in case of BroadcastNestedLoopJoin

We were running an application which was leading to below error -  Caused by: org.apache.spark.SparkException: Could not execute broadcast in 300 secs. You can increase the timeout for broadcasts via spark.sql.broadcastTimeout or disable broadcast join by setting spark.sql.autoBroadcastJoinThreshold to -1 at org.apache.spark.sql.execution.exchange.BroadcastExchangeExec.doExecuteBroadcast(BroadcastExchangeExec.scala:150) at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeBroadcast$1.apply(SparkPlan.scala:154) at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeBroadcast$1.apply(SparkPlan.scala:150) Even after setting the below property, which is to disable BroadCast Join. We kept getting above error again. Set spark.sql.autoBroadcastJoinThreshold=-1 On further analysis, we found that it is not a Bug in Spark. You expect the broadcast to stop after you disable the broadcast threshold, by setting spark.sql.autoBroadcastJoinThreshold to -1, but Apache Spark tries to broadc

Spark-JDBC connection with Oracle Fails - java.sql.SQLSyntaxErrorException: ORA-00903: invalid table name

  While connecting Spark with Oracle JDBC, one may observe exception like below -  spark.read.format("jdbc"). option("url", "jdbc:oracle:thin:@//oraclehost:1521/servicename"). option("dbtable", "mytable"). option("user", "myuser").option("driver", "oracle.jdbc.driver.OracleDriver") option("password", "mypassword"). load().write.parquet("/data/out") java.sql.SQLSyntaxErrorException: ORA-00903: invalid table name at oracle.jdbc.driver.T4CTTIoer.processError(T4CTTIoer.java:447) at oracle.jdbc.driver.T4CTTIoer.processError(T4CTTIoer.java:396) at oracle.jdbc.driver.T4C8Oall.processError(T4C8Oall.java:951) at oracle.jdbc.driver.T4CTTIfun.receive(T4CTTIfun.java:513) at oracle.jdbc.driver.T4CTTIfun.doRPC(T4CTTIfun.java:227) at oracle.jdbc.driver.T4C8Oall.doOALL(T4C8Oall.java:531) at oracle.jdbc.driver.T4CPreparedStatement.doOall8(T4CPreparedStatement.java:208)

Hadoop Jobs - javax.security.sasl.SaslException: GSS initiate failed

  If you see below exception while running Big Data Jobs - Spark, MR, Tez, etc. WARNING: YARN_OPTS has been replaced by HADOOP_OPTS. Using value of YARN_OPTS. javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)] Caused by: GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt) javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)] Caused by: GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt) javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)] Solution - The simplest way is to generate kerberos Keytab & initialize same in a user session before running job -  Ex -  kinit

TEZ - How to enable Fetch Task instead of MapReduce Job for simple query in Hive

  Certain simple Hive queries can utilize fetch task, which can avoid the overhead of starting MapReduce job. hive.fetch.task.conversion This parameter controls which kind of simple query can be converted to a single fetch task. Value "none" is added in Hive 0.14 to disable this feature Value "minimal" means SELECT *, FILTER on partition columns (WHERE and HAVING clauses), LIMIT only. hive.fetch.task.conversion.threshold This parameter controls input threshold (in bytes) for applying hive.fetch.task.conversion.

Validate Emails using Python

 Validate Email using Python-  https://pypi.org/project/email-validator/  Installation pip install email-validator Usage Below script read list of emails from file "test.emails". Loop & validate each email in the file. #!/usr/bin/python from email_validator import validate_email, EmailNotValidError filename="/home/dinesh/setuptools-7.0/test.emails" total_count=0 valid_count=0 invalid_count=0 with open(filename, "r") as a_file: for line in a_file: stripped_line = line.strip() print(stripped_line) total_count=total_count+1 try: # Validate. valid = validate_email(stripped_line) valid_count=valid_count+1 # Update with the normalized form. #email = valid.email except EmailNotValidError as e: # email is not valid, exception message is human-readable print(str(e)) invalid_count=invalid_count+1 print("Total Count"+str(total_count)) print("Valid Count"+str(valid_count)) print("Invalid Count"

Spark HBase Connector - Don't support IN Clause

We came across a scenario in using "shc-core-1.1.0.3.1.5.0-152.jar". A Spark data frame was created on one of the HBase Tables. We queried this data frame like "select * from df where col in ('A', 'B', 'C')" and found that filter on col is not working.  But, if re-write the same SQL like "select * from df where col = 'A' or col= 'B' or col= 'C' then it works.

Copy code of Git Repo in to a different Git Repo with History Commits

  1) Git clone  git clone <url to Source repo> temp-dir 2) Check different branches git branch -a 3) Checkout all the branches that you want to copy git checkout branch-name 4) fetch all the tags  git fetch --tags 5) Clear the link to Source repo git remote rm origin 6) Link your local repository to your newly created NEW repository git remote add origin <url to NEW repo> 7) Push all your branches and tags with these commands: git push origin --all git push --tags 8) Above steps complete copy from Source repo to New repo

Spark -Teradata connection Issues

Exception:  Caused by: java.lang.NullPointerException         at com.teradata.tdgss.jtdgss.TdgssConfigApi.GetMechanisms(Unknown Source)         at com.teradata.tdgss.jtdgss.TdgssManager.<init>(Unknown Source)         at com.teradata.tdgss.jtdgss.TdgssManager.<clinit>(Unknown Source) Brief: tdgssconfig.jar can't be found on the classpath. Please add same on classpath. Exception:  java.sql.SQLException: [Teradata Database] [TeraJDBC 15.10.00.33] [Error 3707] [SQLState 42000] Syntax error, expected something like a name or a Unicode delimited identifier or an 'UDFCALLNAME' keyword or '(' between the 'FROM' keyword and the 'SELECT' keyword. Brief: Normally Spark JDBC expects DBTable property to be a Table Name. So, internally it prepends "select * from" to Table Name. Like  select * from <Table Name> But, If we specify SQL instead of  Table Name then internally SQL will become something like: select * from select ... ; Above m

Splunk Data to Hadoop Ingestion

One of the approach to get data from Splunk to Hadoop is to use REST API provided by Splunk. Such that periodically data is ingested to Hadoop Data Lake.  Simple command like below can help in such scenario: curl  -u '<username>:<password>' \    -k https://splunkhost:8089/services/search/jobs/export \   -d search="search index=myindex | head 10" \   -d output_mode=raw \    | hdfs dfs -put -f - <HDFS_DIR>    Above command will get top 10 rows from Splunk index "myindex" and will ingest it to Hadoop Data Lake

Sqoop Import: New Line Character in one of the column value

Sometimes data produced by Sqoop Import may contain New Line Character. This may result failure to correctly read the data. To resolve same follow either of below solution: Specify following options with Sqoop: --map-column-java <Column name that contains New Line>=String --hive-drop-import-delims Or, Update Sqoop SQL and select the column with regex replacement, like: regexp_replace(<Column name that contains New Line>, '[[:space:]]+', ' ') 

SASL Exception on HDP Sandbox while running Pig action via Oozie

While running Pig scripts via Oozie, you might face SASL exception (even though Kerberos might be disabled).  To resolve same, just comment out following lines in hive-site.xml then upload it to “oozie.wf.application.path”, which were – <!--property>        <name>hive.metastore.kerberos.keytab.file</name>      <value>/etc/security/keytabs/hive.service.keytab</value>  </property>  <property>       <name>hive.metastore.kerberos.principal</name>        <value>hive/_HOST@EXAMPLE.COM</value>  </property-->

Hive Partitioned view Errors: IndexOutOfBoundsException

 Setup: Created table T1 : create table t1 (name string) PARTITIONED BY (c1 string) Created table v1: create view v1 PARTITIONED ON (c1) select * from t1 Inserted data in T1 with partition c1=A Did Alter view to add partition to V1 Noticed that: show partitions v1 – works fine show create table v1 – don’t show that view is partitioned. select * from v1 – gives error – “FAILED: IndexOutOfBoundsException” select * from v1 where c1 like '%%' - gives error – “FAILED: IndexOutOfBoundsException” select * from v1 where c1='A' – If partition column is specified than query works fine.

HDP Sandbox : Oozie web console is disabled.

If you see below message: To enable Oozie web console install the Ext JS library. Refer to Oozie Quick Start documentation for details. Oozie Web Console Solution: wget http://archive.cloudera.com/gplextras/misc/ext-2.2.zip Stop Oozie service from Ambari Copy it to the path: /usr/hdp/current/oozie-client/libext Regenerate the war file by executing: $ /usr/hdp/current/oozie-server/bin/oozie-setup.sh prepare-war Start Oozie again

Beeline (Hive) - Exception in thread "main" java.lang.OutOfMemoryError: Java heap space

 Exception -  Exception in thread "main" java.lang.OutOfMemoryError: Java heap space                     at java.util.Arrays.copyOf(Arrays.java:3236)                     at java.io.ByteArrayOutputStream.toByteArray(ByteArrayOutputStream.java:191)                     at org.apache.hive.beeline.BeeLine.getConsoleReader(BeeLine.java:905)                     at org.apache.hive.beeline.BeeLine.begin(BeeLine.java:792)  Issue -  https://issues.apache.org/jira/browse/HIVE-10836 The issue is caused by large Beeline History file (<user_home>/.beeline/history) Analysis - 1.     If we do “beeline –e “select 1;” then nothing goes to .beeline/history 2.     If we do “beeline –f test.sql” then nothing goes to .beeline/history 3.     If we do “cat test.sql | beeline” then data goes in to .beeline/history If we do “beeline <<< "select 124;" “ then data goes to .beeline/history Solution - Cleanup the history.

Spark Error - missing part 0 of the schema, 2 parts are expected

 Exception -  Caused by: org.apache.spark.sql.AnalysisException : Could not read schema from the hive metastore because it is corrupted. (missing part 0 of the schema, 2 parts are expected).; Analysis -  ·          Check for table definition. In TBLProperties, you might find something like this – > spark.sql.sources.schema.numPartCols > 'spark.sql.sources.schema.numParts' 'spark.sql.sources.schema.part.0' > 'spark.sql.sources.schema.part.1' 'spark.sql.sources.schema.part.2' > 'spark.sql.sources.schema.partCol.0' > 'spark.sql.sources.schema.partCol.1' That’s what error seems to say that part1 is defined but part0 is missing.  Solution -  Drop & re-create table. If Table was partitioned  then all partitions  would have been removed. So do either of below -  ·          Msck repair table <db_name>.<table_name> ·          alter table <db_name>.<table_name> add partition (<partion_name&g