QueryDB

Posts

Showing posts from 2023

Spark: Decimal Column are shown in scientific notation instead of numbers

This issue relates to - When dataframe decimal type column having scale higher than 6, 0 values are shown in scientific notation SPARK-25177 Solution - One can use format_number UDF to convert scientific notation into String, as shown below -

How indexes work in SQL

Hive (Spark) SQLs not working, failing with error of table lock

We had a situation recently where-in Hive or Spark Jobs stopped working resulting in Table Lock error. On debugging Hive Metastore logs, we found following error - ORA-00001: unique constraint (SDL_HIVE.NOTIFICATION_LOG_EVENT_ID) violated We figured out that its an Oracle Unique Index that was throwing exception on inserting entries into NOTIFICATION_LOG table. Hive’s metastore database primarily tracks partition locations and compaction work for the tables. Importantly, it also tracks the changes that occur in the DB so any federation/backup Hive clusters can successfully receive the data. This particular list of changes lives in the NOTIFICATION_LOG table. The ID column in this table are incremented using NOTIFICATION_SEQUENCE table. Somehow, new EVENT_ID(s) which were generated had already had an entry in NOTIFICATION_LOG table. Thus, we observed this failure. Thus, to solve above issue - We took backup of NOTIFICATION_LOG table. And, Truncated NOTIFICATION_LOG table.

HBase Performance Optimization- Page4

Refer previous article @ https://querydb.blogspot.com/2023/11/hbase-performance-optimization-page3.html This article is more about optimizing API ( Rest/ SOAP Web service) build on top of HBase. We had Spring Boot application deployed in docker container. On further analysis, we found that - Our Spring Boot version was 2.0.4.RELEASE, which by default has Tomcat 8.5.32 and HTTPCore 4.4.10 There was reported bug regarding same which leads to many TCP connections in CLOSE_WAIT status, which eventually makes API unresponsive. Refer - java - tomcat with https get a CLOSE_WAIT when directly TCP connect - Stack Overflow 60035 – java.io.IOException: Too many open files (apache.org) downloads.apache.org/httpcomponents/httpcore/RELEASE_NOTES-4.4.x.txt Solution - Upgrade Spring Boot version to 2.7.17 as that is last release in 2.x releases, compatible with Java 8, and includes httpcore v4.4.16 and tomcat v9.0.82 There were too many HTTP connections to API leading to Connection TimedOut

HBase Performance Optimization- Page3

Refer previous article here @ https://querydb.blogspot.com/2023/10/hbase-performance-optimization-page2.html It was determined that for Spark Jobs with org.apache.hadoop.hive.hbase.HBaseStorageHandler , following were set by default- "cacheBlocks":true "caching":-1 As we have frequent Scan's most of HBase Memory Cache was occupied by Analytical Tables. Also, having caching as "-1" means for every row there will be a RPC call. For example if ABC table has 30 million records that will lead to same amount of calls for each scan. Finally, we were able to figure out solution for same. We require to set following properties for Hive on Hbase table - alter table T1 set TBLPROPERTIES('hbase.scan.cacheblock'='false'); alter table T1 set TBLPROPERTIES('hbase.scan.cache'='1000'); By setting above properties Scan data won't be cached, and it will reduce number of RPC calls to HBase. For example, ABC Table with 30 million reco

Spark job fails with Parquet column cannot be converted error

Exception - Caused by: org.apache.spark.sql.execution.QueryExecutionException: Parquet column cannot be converted in file hdfs://mylake/day_id=20231026/part-00009-94b5fdf9-bb52-4774-8d88-82e9c529f77f-c000.snappy.parquet. Column: [ACCOUNT_ID], Expected: string, Found: FIXED_LEN_BYTE_ARRAY at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:187) .... Caused by: org.apache.spark.sql.execution.datasources.SchemaColumnConvertNotSupportedException at org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.constructConvertNotSupportedException(VectorizedColumnReader.java:250) at org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readFixedLenByteArrayBatch(VectorizedColumnReader.java:536) .... Cause The vectorized Parquet reader is decoding the decimal type column to a binary format. Solution One can test either of below solution - Read Parquet file directly from HDFS instead of Hive Table. Or, If

HBase Performance Optimization- Page2

Refer Page 1 of this article @ https://querydb.blogspot.com/2023/10/hbase-performance-optimization.html Normally, we run multiple workloads on the cluster. This includes Analytical as well as API calls. This also involves read & write traffic as well... HBase provides the following mechanisms for managing the performance of a cluster handling multiple workloads: . Quotas . Request Queues . Multiple-Typed Queues Quotas HBASE-11598 introduces RPC quotas, which allow you to throttle requests based on the following limits Limit overall network throughput and number of RPC requests Limit amount of storage used for table or namespaces Limit number of tables for each namespace or user Limit number of regions for each namespace For this to work - Set the hbase.quota.enabled property in the hbase-site.xml file to true. Enter the command to set the set the limit of the quota, type of quota, and to which entity to apply the quota. The command and its syntax are: $hbase_shell> set_quota

Spark Custom Kafka Partitioner

Custom Partitoner can be implemented by extending org.apache.kafka.clients.producer.Partitioner. This can be used with Spark-SQL Kafka Data Source by setting property "kafka.partitioner.class" For example df.write.format("kafka").option("kafka.partitioner.class", "com.mycustom.ipartitioner") We implemented one such custom partitioner extending org.apache.kafka.clients.producer.RoundRobinPartitioner. Complete Source code is available @ https://github.com/dinesh028/engineering/blob/master/Kafka/com/aquaifer/producer/KeyPartitioner.scala This paritioner - Reads a configuration file which has Kafka Key and PrimaryKey Name mapping. Value in Kafka is a JSON Message which has a Primary Key with unique Value. Idea is to partition messages based on this unique value, such that messages with same value for primarykey go into same partition. Once, configurations are loaded. For each byte array message- convert it to String JSON Parse JSON Get uniqu

Fix - HBase Master UI - hbck.jsp - NULLPointerException

At times HBase Master UI hbck report shows nullpointer exception - https://hmaster:16010/hbck.jsp This page displays two reports: the HBCK Chore Report and the CatalogJanitor Consistency Issues report. Only report titles show if there are no problems to list. Note some conditions are transitory as regions migrate. See below for how to run reports. ServerNames will be links if server is live, italic if dead, and plain if unknown. Solution- If this page displays nullpointer exception then execute - echo "hbck_chore_run" |hbase shell if page still displays null pointer exception and not the report then execute - echo " catalogjanitor_run " |hbase shell

Hue Oracle Database- Delete non standard user entries from Parent and Child Tables.

We recently implemented authentication for Hue (https://gethue.com/) before that folks were allowed to authenticate with any kind of username and use Hue WebUI. After implementing authentication it was found that Hue Database consisted entries for old unwanted garbage users which were still able to use the service and by pass authentication. Thus, it was required to delete such user entries from table HUE.AUTH_USER. But there were many associated child constraints, with the table which made deleting an entry violate constraints. Thus, it was required to find out all child tables associated with HUE.AUTH_USER. First delete entries from Child tables followed by deleting entries from Parent HUE.AUTH_USER. We used following query to find all child tables, constraints and associated columns - select a.table_name parent_table_name, b.r_constraint_name parent_constraint, c.column_name parent_column, b.table_name child_table, b.constraint_name as child_constraint, d.column_name child_colu

HBase Performance Optimization

Please refer - First blog in series to reduce Regions on Region Server - https://querydb.blogspot.com/2023/03/hbase-utility-merging-regions-in-hbase.html Second to delete column's in HBase - https://querydb.blogspot.com/2019/11/hbase-bulk-delete-column-qualifiers.html In this article, we would discuss options to further optimize HBase. We could use COMPRESSION=>'SNAPPY' for Column families. And, invoke Major Compaction right after setting the property. This will reduce size of tables by 70% yet giving same read & write performance. Once size of regions & tables is compressed then we can re invoke the Merge Region utility to reduce number of regions per server. Set Region Split policy as - SPLIT_POLICY=>'org.apache.hadoop.hbase.regionserver.ConstantSizeRegionSplitPolicy' Enable Request Throttle by setting hbase.quota.enabled to true Our HBase Cluster is used by Real Time Api's as well as Analytical Spark & MR Jobs. Analytical workloads crea

(AWS EMR) Spark Error - org.apache.kafka.common.TopicPartition; class invalid for deserialization

Spark Kafka Integration Job leads to error below - Caused by: java.io.InvalidClassException: org.apache.kafka.common.TopicPartition; class invalid for deserialization at java.io.ObjectStreamClass$ExceptionInfo.newInvalidClassException(ObjectStreamClass.java:169) at java.io.ObjectStreamClass.checkDeserialize(ObjectStreamClass.java:885) That is because CLASSPATH might be having two or more different version of kafka-clients-*.jar. For example - One may be dependent Jar with "spark-sql-kafka", and other version might be present by default on cluster. For example in our case- AWS EMR had "/usr/lib/hadoop-mapreduce/kafka-clients-0.8.2.1.jar" But, we provided following in spark-submit classpath - spark-sql-kafka-0-10_2.12-2.4.4.jar kafka-clients-2.4.0.jar We tried removing "kafka-clients-2.4.0.jar" from spark-submit --jars but that lead to same error. So, we were finally required to remove EMR provided Jar - "kafka-clients-0.8.2.1.jar" to fix

Spark error- java.lang.IllegalStateException: Expected SaslMessage, received something else (maybe your client does not have SASL enabled?)

Exception Trace - 23/09/08 19:37:39 dispatcher-event-loop-0-dispatcher-event-loop-0id ERROR YarnClusterScheduler: Lost executor 1 on Unable to create executor due to Unable to register with external shuffle server due to : java.lang.IllegalStateException: Expected SaslMessage, received something else (maybe your client does not have SASL enabled?) at org.apache.spark.network.sasl.SaslMessage.decode(SaslMessage.java:69) at org.apache.spark.network.sasl.SaslRpcHandler.doAuthChallenge(SaslRpcHandler.java:80) at org.apache.spark.network.server.AbstractAuthRpcHandler.receive(AbstractAuthRpcHandler.java:59) at org.apache.spark.network.server.TransportRequestHandler.processRpcRequest(TransportRequestHandler.java:180) Reason - This error was coming in our case when - spark.shuffle.service.enabled=true spark.dynamicAllocation.enabled=true Solution - Set following false - spark.shuffle.service.enabled=false spark.dynamicAllocation.enabled=false Or, Set following property - spark.authe

Logstash connect to Kerberos authenticated Hive Service

Normally, one can write syntax like below to create a JDBC connection with Hive - input { jdbc { jdbc_driver_library => "hive-jdbc-2.0.0.jar,hive2.jar,hive-common-2.3.1.jar,hadoop-core-1.2.1-0.jar" jdbc_driver_class => "org.apache.hive.jdbc.HiveDriver" jdbc_connection_string => "" } } output { # Publish out in command line stdout { codec => json } } But, you will get problem if you need to do Kerberos authentication for using Hive JDBC. Relating to this, set following JVM Options. Note that these can be set with either within config/jvm.options file or setting the LS_JAVA_OPTS variable will additive override JVM settings. Refer - https://www.elastic.co/guide/en/logstash/current/jvm-settings.html -Djava.security.auth.login.config=<Jass_config_file_path> (Required) -Djava.security.krb5.conf=<Path to krb5.conf> (if it is not in default location under /etc/) if KRB5.conf is not specified then y

Generate or Create a Keytab File (Kerberos)

Steps as below - Run ktutil to launch the command line utility Type command - addent -password -p $user @ $REALM -k 1 -e $encryptionType Note replace the highlighted keywords - $user - Name of the user $REALM - Kerberos realm is the domain over which a Kerberos authentication server has the authority to authenticate a user, host or service $encryptionType - Type of Encryption like - aes256-cts des3-cbc-sha1-kd RC4-HMAC arcfour-hmac-md5 des-hmac-sha1 des-cbc-md5 , etc. You can add one or more entry(s) for different types of encryption. When prompted, enter the password for the Kerberos principal user. Type the following command to write a keytab file - wkt $user .keytab Type 'q' to quit the utility. Verify the keytab is created and has the right User Entry - Execute below command - klist -ekt $PWD/ $user .keytab Initialize the keytab or generate a ticket- Execute below command - kinit $user @ $REALM -kt $PWD/ $user .keytab Display list of currently cached Kerber

Spark Hadoop EMR Cross Realm Access HBase & Kafka

We had in-premise Hadoop Cluster which included Kafka, HBase, HDFS, Spark, YARN , etc. We planned to migrate our Big Data Jobs and Data to AWS EMR but still keeping Kafka on in-premise CDP cluster. After Spawning EMR on AWS. We tried running Spark Job connecting to Kafka on in-premise cluster. We did setup all VPC connections & opened 2firewall ports between the two clusters. But, since EMR and CDP (in-premise) had different KDC Server & principal, it kept on failing for us to connect to Kafka ( in-premise) from EMR. Note, one can set following property to see Kerberos logs - -Dsun.security.krb5.debug=true The easiest option for us were two - Setup Cross-Realm Kerberos trust. Such that EMR principal in-premise KDC Server to use kafka service. Refer - https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/7/html/system-level_authentication_guide/using_trusts Setup to Cross-Realm trust using same AD accounts and domain. Refer https://docs.aws.amazon.com/emr/la

Spark MongoDB Write Error - com.mongodb.MongoBulkWriteException: Bulk write operation error on server 'E11000 duplicate key error collection:'

One may see following error or exception, while running Spark 2.4 with - mongo-spark-connector_2.11-2.4.0.jar mongo-java-driver-3.9.0.jar Exception - User class threw exception: org.apache.spark.SparkException: Job aborted due to stage failure: Task 2 in stage 6.0 failed 4 times, most recent failure: Lost task 2.3 in stage 6.0 (TID 238, nc0020.hadoop.mycluster.com, executor 2): com.mongodb.MongoBulkWriteException: Bulk write operation error on server vondbd0008.mymachine.com:27017. Write errors: [BulkWriteError{index=0, code=11000, message='E11000 duplicate key error collection: POC1_DB.MyCollection index: _id_ dup key: { _id: "113442141" }', details={ }}]. at com.mongodb.connection.BulkWriteBatchCombiner.getError(BulkWriteBatchCombiner.java:177) at com.mongodb.connection.BulkWriteBatchCombiner.throwOnError(BulkWriteBatchCombiner.java:206) at com.mongodb.connection.BulkWriteBatchCombiner.getResult(BulkWriteBatchCombiner.java:147) at com.mongodb.operation.BulkWrite

Python pyodbc - Error - [unixODBC][Oracle][ODBC][Ora]ORA-12162: TNS:net service name is incorrectly specified

One may encounter below errors while connecting to oracle, using pyodbc, using python 3 [unixODBC][Driver Manager]Can't open lib 'Oracle ODBC driver for Oracle 19' : file not found (0) (SQLDriverConnect) [unixODBC][Oracle][ODBC][Ora]ORA-12162: TNS:net service name is incorrectly specified\n (12162) (SQLDriverConnect) RuntimeError: Unable to set SQL_ATTR_CONNECTION_POOLING attribute The solution to fix above errors is to - Make following entry in /etc/odbcinst.ini [Oracle ODBC driver for Oracle 19] Description=Oracle ODBC driver for Oracle 19 Driver=$ORACLE_HOME/lib/libsqora.so.19.1 FileUsage=1 Driver Logging=7 UsageCount=1 Don't use following connect string - import pyodbc myhost='<>' myservicename='<>' myuserid='<>' mypassword='<>'

CVE-2022-33891 Apache Spark Command Injection Vulnerability

Please refer - https://spark.apache.org/security.html The command injection occurs because Spark checks the group membership of the user passed in the ?doAs parameter by using a raw Linux command. If an attacker is sending reverse shell commands using ?doAs . There is also a high chance of granting apache spark server access to the attackers’ machine. Vulnerability description - The Apache Spark UI offers the possibility to enable ACLs via the configuration option spark.acls.enable. With an authentication filter, this checks whether a user has access permissions to view or modify the application. If ACLs are enabled, a code path in HttpSecurityFilter can allow someone to perform impersonation by providing an arbitrary user name. A malicious user might then be able to reach a permission check function that will ultimately build a Unix shell command based on their input, and execute it. This will result in arbitrary shell command execution as the user Spark is currently running as. V

Hive Metastore ER Diagram

Hadoop Distcp to HCP or AWS S3a leading to Error - com.amazonaws.SdkClientException: Unable to execute HTTP request: sun.security.validator.ValidatorException: PKIX path building failed

Running Hadoop Distcp to copy data from S3a resulted in below error - **com.amazonaws.SdkClientException: Unable to execute HTTP request: sun.security.validator.ValidatorException: PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target” Stack trace: com.amazonaws.SdkClientException: Unable to execute HTTP request: sun.security.validator.ValidatorException: PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target at com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleRetryableException(AmazonHttpClient.java:1114) ~[aws-java-sdk-core-1.11.280.jar!/:?] at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:1064) ~[aws-java-sdk-core-1.11.280.jar!/:?] To debug this error, turn SSL debug logging on -Djavax.net.debug=all , or -Djavax.net.debug=ssl Above parameters c

Spark Exception - Filtering is supported only on partition keys of type string

We got this exception while performing SQL on Hive Table with Partition Column as BIGINT. Ex - select * from mytable where cast(rpt_date AS STRING) >= date_format(date_sub(current_date(),60),'yyyyMMdd') Exception - Caused by: java.lang.reflect.InvocationTargetException: org.apache.hadoop.hive.metastore.api.MetaException: Filtering is supported only on partition keys of type string at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.spark.sql.hive.client.Shim_v0_13.getPartitionsByFilter(HiveShim.scala:862) ... 101 more Caused by: org.apache.hadoop.hive.metastore.api.MetaException: Filtering is supported only on partition keys of type string at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$get_partitions_by

PrestoDB (Trino) SQL Error - java.lang.UnsupportedOperationException: Storage schema reading not supported

We faced following error while querying via Trino, on a Hive Table defined on top of AVRO file format. Error - java.lang.UnsupportedOperationException: Storage schema reading not supported The Solution is to set following property in Hive Metastore - metastore.storage.schema.reader.impl=org.apache.hadoop.hive.metastore.SerDeStorageSchemaReader

PrestoDB (Trino) SQL Error - ORC ACID file should have 6 columns, found Y

We faced this error while querying Hive Table using Trino - Error - SQL Error [16777223]: Query failed (#20230505_155701_00194_n2vdp): ORC ACID file should have 6 columns, found 17 This was happening because Table being queried was Hive Managed Internal Table, which by default in CDP ( Cloudera ) distribution is ACID compliant. Now, in order for a Hive Table to be ACID complaint - The underlying file system should be ORC, and there were a few a changes on ORC file structure like the root column should be a struct with 6 nested columns (which encloses the data and the type of operation). Something like below struct< operation: int, originalTransaction: bigInt, bucket: int, rowId: bigInt, currentTransaction: bigInt, row: struct<...> > For more ORC ACID related internals - please take a look here https://orc.apache.org/docs/acid.html Now, problem in our case was that though Hive Table was declared Inte

Spark Exception - java.lang.NullPointerException

java.lang.NullPointerException at org.apache.spark.sql.execution.datasources.orc.OrcColumnVector.getUTF8String(OrcColumnVector.java:167) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage46.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$1.hasNext(WholeStageCodegenExec.scala:614) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409) at org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:187) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53) at org.apache.spark.scheduler.Task.run(Task.scala:109) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345) at

Fixing hbck Inconsistencies

Execute 'hbck_chore_run' in hbase shell to generate a new sub-report. Hole issue: - verify if region is existing in both HDFS and meta. - If not in HDFS it is data loss or cleared by cleaner_chore already. - If not in Meta we can use hbck2 jar reportMissingInMeta option to find out the missing records in meta - Then use addFsRegionsInMeta option to add missing records back to meta - Then restart Active Master and then assigns those regions Orphan Regions: Refer https://community.cloudera.com/t5/Support-Questions/Hbase-Orphan-Regions-on-Filesystem-shows-967-regions-in-set/td-p/307959 - Do "ls" to see "recovered.edits" if there is no HFile means that region was splitting and it failed. - Replay using WALPlayer hbase org.apache.hadoop.hbase.mapreduce.WALPlayer hdfs://bdsnameservice/hbase/data/Namespace/Table/57ed0b774aef9158cfda87c945a0afae/recovered.edits/0000000000001738473 Namespace:Table - Move the Orphan region to some temporary location and clean up

HBase Utility - Merging Regions in HBase

Growing HBase Cluster, and difficulty to get physical hardware is something that every enterprise deals with... And sometimes, the amount of ingest starts putting pressure on components before new hosts can be added. As, I write this post our cluster was running with 46 nodes and each node having 600 regions per server. This is bad for performance. You can use the following formula to estimate the number of regions for a RegionServer: (regionserver_memory_size) * (memstore_fraction) / ((memstore_size) * (num_column_families)) Reference https://docs.cloudera.com/HDPDocuments/HDP3/HDP-3.0.0/hbase-data-access/content/memstore-size-regionservers.html We noticed that HBase doesn't have automatic process to reduce and merge regions. As over the time, many small size or empty regions are formed on cluster which degrades the performance. While researching to cope up with this problem: we came across following scripts - https://appsintheopen.com/posts/51-merge-empty-hbase-regions https

Refresh Multiple lines or Single line in Linux Shell

Below code gives you an example to refresh multiple lines using tput while :; do echo "$RANDOM" echo "$RANDOM" echo "$RANDOM" sleep 0.2 tput cuu1 # move cursor up by one line tput el # clear the line tput cuu1 tput el tput cuu1 tput el done Below code gives you an example to refresh or reprint same line on STDOUT while true; do echo -ne "`date`\r"; done

Which one should I use - PrestoDB or Trino ?

First thing to understand is why to use Presto or Trino. We had been running two clusters specifically Hortonworks (HDP) variant & Cloudera (CDP) variant. Hive Tables built on HDP were mostly ORC whereas Tables that existed for us on CDP were mostly Parquet. We wanted to add ad-hoc querying functionality to our cluster. And, we came across Apache Impala as an excellent tool for this purposes. Only CDP supported Apache Impala. Impala had limitation to work with Parquet, Kudu, HBase. Before CDP 6.* there was no support for ORC file format with Impala. Thus, we came to know about PrestoDB, which was built at Facebook, and was an excellent distributed SQL Engine for ad-hoc querying. It not only supported ORC but has connectors for multiple data sources. A bit history of Presto - Developed at Facebook ( 2012) Supported by Presto Foundation establish by Linux Foundation (2019) Original Developers & Linux Foundation get into conflict on naming & branding. Did a hard Fork o