Skip to main content

Posts

Hive (Spark) SQLs not working, failing with error of table lock

  We had a situation recently where-in Hive or Spark Jobs stopped working resulting in Table Lock error.  On debugging Hive Metastore logs, we found following error -  ORA-00001: unique constraint (SDL_HIVE.NOTIFICATION_LOG_EVENT_ID) violated We figured out that its an Oracle Unique Index that was throwing exception on inserting entries into NOTIFICATION_LOG table. Hive’s metastore database primarily tracks partition locations and compaction work for the tables. Importantly, it also tracks the changes that occur in the DB so any federation/backup Hive clusters can successfully receive the data. This particular list of changes lives in the NOTIFICATION_LOG table. The ID column in this table are incremented using NOTIFICATION_SEQUENCE table. Somehow, new EVENT_ID(s) which were generated had already had an entry in  NOTIFICATION_LOG table. Thus, we observed this failure.  Thus, to solve above issue -  We took backup of NOTIFICATION_LOG table. And, Truncated NOTIFICATION_LOG table.

HBase Performance Optimization- Page4

  Refer previous article @  https://querydb.blogspot.com/2023/11/hbase-performance-optimization-page3.html This article is more about optimizing API ( Rest/ SOAP Web service) build on top of HBase. We had Spring Boot application deployed in docker container. On further analysis, we found that -   Our Spring Boot version was 2.0.4.RELEASE, which by default has Tomcat  8.5.32  and HTTPCore 4.4.10 There was reported bug regarding same which leads to many TCP connections in CLOSE_WAIT status, which eventually makes API unresponsive.  Refer -  java - tomcat with https get a CLOSE_WAIT when directly TCP connect - Stack Overflow 60035 – java.io.IOException: Too many open files (apache.org) downloads.apache.org/httpcomponents/httpcore/RELEASE_NOTES-4.4.x.txt Solution -  Upgrade Spring Boot version to 2.7.17 as that is last release in 2.x releases, compatible with Java 8, and includes httpcore v4.4.16 and tomcat v9.0.82 There were too many HTTP connections to API leading to Connection TimedOut

HBase Performance Optimization- Page3

  Refer previous article here @ https://querydb.blogspot.com/2023/10/hbase-performance-optimization-page2.html It was determined that for Spark Jobs with org.apache.hadoop.hive.hbase.HBaseStorageHandler , following were set by default- "cacheBlocks":true "caching":-1 As we have frequent Scan's most of HBase Memory Cache was occupied by  Analytical Tables. Also, having caching as "-1" means for every row there will be a RPC call. For example if ABC table has 30 million records that will lead to same amount of calls for each scan. Finally, we were able to figure out solution for same. We require to set following properties for Hive on Hbase table - alter table T1 set TBLPROPERTIES('hbase.scan.cacheblock'='false'); alter table T1 set TBLPROPERTIES('hbase.scan.cache'='1000'); By setting above properties Scan data won't be cached, and it will reduce number of RPC calls to HBase. For example, ABC Table with 30 million reco

Spark job fails with Parquet column cannot be converted error

  Exception -  Caused by: org.apache.spark.sql.execution.QueryExecutionException: Parquet column cannot be converted in file hdfs://mylake/day_id=20231026/part-00009-94b5fdf9-bb52-4774-8d88-82e9c529f77f-c000.snappy.parquet. Column: [ACCOUNT_ID], Expected: string, Found: FIXED_LEN_BYTE_ARRAY   at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:187) .... Caused by: org.apache.spark.sql.execution.datasources.SchemaColumnConvertNotSupportedException   at org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.constructConvertNotSupportedException(VectorizedColumnReader.java:250)   at org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readFixedLenByteArrayBatch(VectorizedColumnReader.java:536)  .... Cause The vectorized Parquet reader is decoding the decimal type column to a binary format. Solution One can test either of below solution -  Read Parquet file directly from HDFS instead of Hive Table. Or, If

HBase Performance Optimization- Page2

  Refer Page 1 of this article @ https://querydb.blogspot.com/2023/10/hbase-performance-optimization.html Normally, we run multiple workloads on the cluster. This includes Analytical as well as API calls. This also involves read & write traffic as well... HBase provides the following mechanisms for managing the performance of a cluster handling multiple workloads: . Quotas . Request Queues . Multiple-Typed Queues Quotas HBASE-11598 introduces RPC quotas, which allow you to throttle requests based on the following limits Limit overall network throughput and number of RPC requests Limit amount of storage used for table or namespaces Limit number of tables for each namespace or user Limit number of regions for each namespace For this to work -  Set the hbase.quota.enabled property in the hbase-site.xml file to true. Enter the command to set the set the limit of the quota, type of quota, and to which entity to apply the quota. The command and its syntax are: $hbase_shell> set_quota