Skip to main content

Posts

Hadoop Distcp Error Duplicate files in input path

  One may face following error while copying data from one cluster to other, using Distcp  Command: hadoop distcp -i {src} {tgt} Error: org.apache.hadoop.toolsCopyListing$DulicateFileException: File would cause duplicates. Ideally there can't be same file names. So, what might be happening in your case is you trying to copy partitioned table from one cluster to other. And, 2 different named partitions have same file name. Your solution is to correct Source path  {src}  in your command, such that you provide path uptil partitioned sub directory, not the file. For ex - Refer below : /a/partcol=1/file1.txt /a/partcol=2/file1.txt If you use  {src}  as  "/a/*/*"  then you will get the error  "File would cause duplicates." But, if you use  {src}  as  "/a"  then you will not get error in copying.

Hive SQL( using TEZ as execution engine) not giving result on empty partition

  Hive SQL( using TEZ as execution engine) not giving incorrect, or not expected results on empty partitions. We got this issue on Hive version: 3.1.0.3.1.5.0-152 ( HDP Version 3.1.5.0-152) To replicate the issue -  --Create external Table 1) Create external table test_tbl ( name string) partitioned by ( company string, processdate string) stored as orc location '/my/some/random/location';  -- Add partion 2) Alter table test_tbl add partition ( company='aquaifer', processdate='20220101');   -- Execute following SQL's which returns no records. 3) select max( company ) , processdate  from test_tbl  group by processdate  ; 4) select max(processdate ) from test_tbl  ;   Same SQL (#3 & #4 above) , when execute with SPARK, returns  '0' count and  '20220101' respectively.  So as a solution, we started using "spark-sql" instead of "hive/ beeline" We didn't find a solution with hive for above inconsistency, and following bu

Spark Hive ORC Exception Caused by: java.util.concurrent.ExecutionException: com.google.protobuf.InvalidProtocolBufferException: Protocol message tag had invalid wire type.

  Exception  Caused by: java.util.concurrent.ExecutionException: com.google.protobuf.InvalidProtocolBufferException: Protocol message tag had invalid wire type. at java.util.concurrent.FutureTask.report(FutureTask.java:122) at java.util.concurrent.FutureTask.get(FutureTask.java:192) at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.generateSplitsInfo(OrcInputFormat.java:1865) ... 17 more Caused by: com.google.protobuf.InvalidProtocolBufferException: Protocol message tag had invalid wire type. at com.google.protobuf.InvalidProtocolBufferException.invalidWireType(InvalidProtocolBufferException.java:99) Reason You might receive above error while performing SQL operations using Spark or Hive. Above error results because there  may be some corrupt ORC Files or Unsupported version of ORC on HDFS. Solution Identify and remove corrupt or incorrect files from HDFS. Or, With Spark:- You can ignore such files by setting following property -  set spark.sql.hive.convertMetastoreOrc=true

Log4J JNDI Vulnerability

  This post is an extension of  https://querydb.blogspot.com/2021/09/solving-jenkins-maven-build-xray-log4j.html Apart from fix that was discussed in https://querydb.blogspot.com/2021/09/solving-jenkins-maven-build-xray-log4j.html . It is required to upgrade Log4J to 2.15.0 or above due to JNDI attack.  Refer below figure to understand the  deserialization of untrusted data which can be exploited to remotely execute arbitrary code. There are certain posts which suggest to set below property  log4j2.formatMsgNoLookups But, that's  serious vulnerability, you shouldn't contemplate these workarounds and upgrade Log4j jars. Refer  https://logging.apache.org/log4j/2.x/security.html " A new CVE (CVE-2021-45046, see above) was raised for this. Other insufficient mitigation measures are: setting system property log4j2.formatMsgNoLookups or environment variable LOG4J_FORMAT_MSG_NO_LOOKUPS to true for releases >= 2.10, or modifying the logging configuration to disable message look

Run Kafka Console Consumer with Secured Kafka

  1) Create jaas.conf KafkaClient { com.sun.security.auth.module.Krb5LoginModule required doNotPrompt=true useTicketCache=false principal="principalName@domain" useKeyTab=true serviceName="kafka" keyTab="my.keytab" client=true; }; Client { com.sun.security.auth.module.Krb5LoginModule required doNotPrompt=true useTicketCache=false principal="principalName@domain" useKeyTab=true serviceName="kafka" keyTab="my.keytab" client=true; }; 2) Create consumer.properties sasl.mechanism=GSSAPI security.protocol=SASL_SSL sasl.kerberos.service.name=kafka ssl.truststore.location=truststore.jks ssl.truststore.password=changeit group.id=consumer-group-name2 3) Execute following - >export KAFKA_OPTS="-Djava.security.auth.login.config=/path/to/your/jaas.conf" >sh kafka-console-consumer.sh --bootstrap-server kafkabroker.charter.com:6668  --topic TopicName --new-consumer --from-beginning --consumer.config /path/to/consumer.properti