Skip to main content

Posts

Spark: Decimal Column are shown in scientific notation instead of numbers

  This issue relates to - When dataframe decimal type column having scale higher than 6, 0 values are shown in scientific notation  SPARK-25177   Solution -  One can use  format_number UDF to convert scientific notation into String, as shown below - 

How indexes work in SQL

 

Hive (Spark) SQLs not working, failing with error of table lock

  We had a situation recently where-in Hive or Spark Jobs stopped working resulting in Table Lock error.  On debugging Hive Metastore logs, we found following error -  ORA-00001: unique constraint (SDL_HIVE.NOTIFICATION_LOG_EVENT_ID) violated We figured out that its an Oracle Unique Index that was throwing exception on inserting entries into NOTIFICATION_LOG table. Hive’s metastore database primarily tracks partition locations and compaction work for the tables. Importantly, it also tracks the changes that occur in the DB so any federation/backup Hive clusters can successfully receive the data. This particular list of changes lives in the NOTIFICATION_LOG table. The ID column in this table are incremented using NOTIFICATION_SEQUENCE table. Somehow, new EVENT_ID(s) which were generated had already had an entry in  NOTIFICATION_LOG table. Thus, we observed this failure.  Thus, to solve above issue -  We took backup of NOTIFICATION_LOG table. And, Truncat...

HBase Performance Optimization- Page4

  Refer previous article @  https://querydb.blogspot.com/2023/11/hbase-performance-optimization-page3.html This article is more about optimizing API ( Rest/ SOAP Web service) build on top of HBase. We had Spring Boot application deployed in docker container. On further analysis, we found that -   Our Spring Boot version was 2.0.4.RELEASE, which by default has Tomcat  8.5.32  and HTTPCore 4.4.10 There was reported bug regarding same which leads to many TCP connections in CLOSE_WAIT status, which eventually makes API unresponsive.  Refer -  java - tomcat with https get a CLOSE_WAIT when directly TCP connect - Stack Overflow 60035 – java.io.IOException: Too many open files (apache.org) downloads.apache.org/httpcomponents/httpcore/RELEASE_NOTES-4.4.x.txt Solution -  Upgrade Spring Boot version to 2.7.17 as that is last release in 2.x releases, compatible with Java 8, and includes httpcore v4.4.16 and tomcat v9.0.82 There were too many HTTP connec...

HBase Performance Optimization- Page3

  Refer previous article here @ https://querydb.blogspot.com/2023/10/hbase-performance-optimization-page2.html It was determined that for Spark Jobs with org.apache.hadoop.hive.hbase.HBaseStorageHandler , following were set by default- "cacheBlocks":true "caching":-1 As we have frequent Scan's most of HBase Memory Cache was occupied by  Analytical Tables. Also, having caching as "-1" means for every row there will be a RPC call. For example if ABC table has 30 million records that will lead to same amount of calls for each scan. Finally, we were able to figure out solution for same. We require to set following properties for Hive on Hbase table - alter table T1 set TBLPROPERTIES('hbase.scan.cacheblock'='false'); alter table T1 set TBLPROPERTIES('hbase.scan.cache'='1000'); By setting above properties Scan data won't be cached, and it will reduce number of RPC calls to HBase. For example, ABC Table with 30 million reco...