Skip to main content

Posts

HBase Performance Optimization

  Please refer -  First blog in series to reduce Regions on Region Server - https://querydb.blogspot.com/2023/03/hbase-utility-merging-regions-in-hbase.html Second to delete column's in HBase - https://querydb.blogspot.com/2019/11/hbase-bulk-delete-column-qualifiers.html In this article, we would discuss options to further optimize HBase. We could use COMPRESSION=>'SNAPPY' for Column families. And, invoke Major Compaction right after setting the property. This will reduce size of tables by 70% yet giving same read & write performance. Once size of regions & tables is compressed then we can re invoke the Merge Region utility to reduce number of regions per server. Set Region Split policy as - SPLIT_POLICY=>'org.apache.hadoop.hbase.regionserver.ConstantSizeRegionSplitPolicy' Enable Request Throttle by setting hbase.quota.enabled  to true Our HBase Cluster is used by  Real Time Api's as well as Analytical Spark & MR Jobs. Analytical workloads crea

(AWS EMR) Spark Error - org.apache.kafka.common.TopicPartition; class invalid for deserialization

  Spark Kafka Integration Job leads to error below -  Caused by: java.io.InvalidClassException: org.apache.kafka.common.TopicPartition; class invalid for deserialization   at java.io.ObjectStreamClass$ExceptionInfo.newInvalidClassException(ObjectStreamClass.java:169)   at java.io.ObjectStreamClass.checkDeserialize(ObjectStreamClass.java:885) That is because CLASSPATH might be having two or more different version of kafka-clients-*.jar. For example - One may be dependent Jar with "spark-sql-kafka", and other version might be present by default on cluster.  For example in our case-  AWS EMR had "/usr/lib/hadoop-mapreduce/kafka-clients-0.8.2.1.jar" But, we provided following in spark-submit classpath -  spark-sql-kafka-0-10_2.12-2.4.4.jar kafka-clients-2.4.0.jar We tried removing "kafka-clients-2.4.0.jar" from spark-submit --jars but that lead to same error. So, we were finally required to remove EMR provided Jar - "kafka-clients-0.8.2.1.jar" to fix

Spark error- java.lang.IllegalStateException: Expected SaslMessage, received something else (maybe your client does not have SASL enabled?)

  Exception Trace -  23/09/08 19:37:39 dispatcher-event-loop-0-dispatcher-event-loop-0id ERROR YarnClusterScheduler: Lost executor 1 on Unable to create executor due to Unable to register with external shuffle server due to : java.lang.IllegalStateException: Expected SaslMessage, received something else (maybe your client does not have SASL enabled?) at org.apache.spark.network.sasl.SaslMessage.decode(SaslMessage.java:69) at org.apache.spark.network.sasl.SaslRpcHandler.doAuthChallenge(SaslRpcHandler.java:80) at org.apache.spark.network.server.AbstractAuthRpcHandler.receive(AbstractAuthRpcHandler.java:59) at org.apache.spark.network.server.TransportRequestHandler.processRpcRequest(TransportRequestHandler.java:180) Reason -  This error was coming in our case when -  spark.shuffle.service.enabled=true spark.dynamicAllocation.enabled=true Solution -  Set following false -  spark.shuffle.service.enabled=false spark.dynamicAllocation.enabled=false Or, Set following property -  spark.authe

Logstash connect to Kerberos authenticated Hive Service

  Normally, one can write syntax like below to create a JDBC connection with Hive -  input { jdbc { jdbc_driver_library => "hive-jdbc-2.0.0.jar,hive2.jar,hive-common-2.3.1.jar,hadoop-core-1.2.1-0.jar" jdbc_driver_class => "org.apache.hive.jdbc.HiveDriver" jdbc_connection_string => "" } } output { # Publish out in command line stdout { codec => json } } But, you will get problem if you need to do Kerberos authentication for using Hive JDBC. Relating to this, set following JVM Options. Note that these can be set with either within config/jvm.options file or setting the  LS_JAVA_OPTS  variable will additive override JVM settings. Refer - https://www.elastic.co/guide/en/logstash/current/jvm-settings.html -Djava.security.auth.login.config=<Jass_config_file_path> (Required) -Djava.security.krb5.conf=<Path to krb5.conf> (if it is not in default location under /etc/) if KRB5.conf is not specified then y

Generate or Create a Keytab File (Kerberos)

  Steps as below -  Run ktutil to launch the command line utility   Type command -  addent -password -p $user @ $REALM -k 1 -e $encryptionType Note replace the highlighted keywords -  $user - Name of the user $REALM - Kerberos realm is the domain over which a Kerberos authentication server has the authority to authenticate a user, host or service $encryptionType - Type of Encryption like -  aes256-cts des3-cbc-sha1-kd RC4-HMAC arcfour-hmac-md5  des-hmac-sha1 des-cbc-md5 , etc. You can add one or more entry(s) for different types of encryption. When prompted, enter the password for the Kerberos principal user. Type the following command to write a keytab file -  wkt $user .keytab Type 'q' to quit the utility.  Verify the keytab is created and has the right User Entry -  Execute below command -  klist -ekt $PWD/ $user .keytab Initialize the keytab or generate a ticket-  Execute below command -  kinit $user @ $REALM -kt $PWD/ $user .keytab Display list of currently cached Kerber