Skip to main content

Posts

(AWS EMR) Spark Error - org.apache.kafka.common.TopicPartition; class invalid for deserialization

  Spark Kafka Integration Job leads to error below -  Caused by: java.io.InvalidClassException: org.apache.kafka.common.TopicPartition; class invalid for deserialization   at java.io.ObjectStreamClass$ExceptionInfo.newInvalidClassException(ObjectStreamClass.java:169)   at java.io.ObjectStreamClass.checkDeserialize(ObjectStreamClass.java:885) That is because CLASSPATH might be having two or more different version of kafka-clients-*.jar. For example - One may be dependent Jar with "spark-sql-kafka", and other version might be present by default on cluster.  For example in our case-  AWS EMR had "/usr/lib/hadoop-mapreduce/kafka-clients-0.8.2.1.jar" But, we provided following in spark-submit classpath -  spark-sql-kafka-0-10_2.12-2.4.4.jar kafka-clients-2.4.0.jar We tried removing "kafka-clients-2.4.0.jar" from spark-submit --jars but that lead to same error. So, we were finally required to remove EMR provided Jar - "kafka-clients-0.8.2.1.jar" to fix

Spark error- java.lang.IllegalStateException: Expected SaslMessage, received something else (maybe your client does not have SASL enabled?)

  Exception Trace -  23/09/08 19:37:39 dispatcher-event-loop-0-dispatcher-event-loop-0id ERROR YarnClusterScheduler: Lost executor 1 on Unable to create executor due to Unable to register with external shuffle server due to : java.lang.IllegalStateException: Expected SaslMessage, received something else (maybe your client does not have SASL enabled?) at org.apache.spark.network.sasl.SaslMessage.decode(SaslMessage.java:69) at org.apache.spark.network.sasl.SaslRpcHandler.doAuthChallenge(SaslRpcHandler.java:80) at org.apache.spark.network.server.AbstractAuthRpcHandler.receive(AbstractAuthRpcHandler.java:59) at org.apache.spark.network.server.TransportRequestHandler.processRpcRequest(TransportRequestHandler.java:180) Reason -  This error was coming in our case when -  spark.shuffle.service.enabled=true spark.dynamicAllocation.enabled=true Solution -  Set following false -  spark.shuffle.service.enabled=false spark.dynamicAllocation.enabled=false Or, Set following property -  spark.authe

Logstash connect to Kerberos authenticated Hive Service

  Normally, one can write syntax like below to create a JDBC connection with Hive -  input { jdbc { jdbc_driver_library => "hive-jdbc-2.0.0.jar,hive2.jar,hive-common-2.3.1.jar,hadoop-core-1.2.1-0.jar" jdbc_driver_class => "org.apache.hive.jdbc.HiveDriver" jdbc_connection_string => "" } } output { # Publish out in command line stdout { codec => json } } But, you will get problem if you need to do Kerberos authentication for using Hive JDBC. Relating to this, set following JVM Options. Note that these can be set with either within config/jvm.options file or setting the  LS_JAVA_OPTS  variable will additive override JVM settings. Refer - https://www.elastic.co/guide/en/logstash/current/jvm-settings.html -Djava.security.auth.login.config=<Jass_config_file_path> (Required) -Djava.security.krb5.conf=<Path to krb5.conf> (if it is not in default location under /etc/) if KRB5.conf is not specified then y

Generate or Create a Keytab File (Kerberos)

  Steps as below -  Run ktutil to launch the command line utility   Type command -  addent -password -p $user @ $REALM -k 1 -e $encryptionType Note replace the highlighted keywords -  $user - Name of the user $REALM - Kerberos realm is the domain over which a Kerberos authentication server has the authority to authenticate a user, host or service $encryptionType - Type of Encryption like -  aes256-cts des3-cbc-sha1-kd RC4-HMAC arcfour-hmac-md5  des-hmac-sha1 des-cbc-md5 , etc. You can add one or more entry(s) for different types of encryption. When prompted, enter the password for the Kerberos principal user. Type the following command to write a keytab file -  wkt $user .keytab Type 'q' to quit the utility.  Verify the keytab is created and has the right User Entry -  Execute below command -  klist -ekt $PWD/ $user .keytab Initialize the keytab or generate a ticket-  Execute below command -  kinit $user @ $REALM -kt $PWD/ $user .keytab Display list of currently cached Kerber

Spark Hadoop EMR Cross Realm Access HBase & Kafka

  We had in-premise Hadoop Cluster which included Kafka, HBase, HDFS, Spark, YARN , etc. We planned to migrate our Big Data Jobs and Data to AWS EMR but still keeping Kafka on in-premise CDP cluster. After Spawning EMR on AWS. We tried running Spark Job connecting to Kafka on in-premise cluster. We did setup all VPC connections & opened 2firewall ports between the two clusters. But, since EMR and CDP (in-premise) had different KDC Server & principal, it kept on failing for us to connect to Kafka ( in-premise) from EMR. Note, one can set following property to see Kerberos logs -  -Dsun.security.krb5.debug=true The easiest option for us were two -  Setup Cross-Realm Kerberos trust. Such that EMR principal in-premise KDC Server to use kafka service. Refer - https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/7/html/system-level_authentication_guide/using_trusts Setup to Cross-Realm trust using same AD accounts and domain. Refer https://docs.aws.amazon.com/emr/la