Refer previous article here @ https://querydb.blogspot.com/2023/10/hbase-performance-optimization-page2.html
It was determined that for Spark Jobs with org.apache.hadoop.hive.hbase.HBaseStorageHandler, following were set by default-
"cacheBlocks":true
"caching":-1
As we have frequent Scan's most of HBase Memory Cache was occupied by Analytical Tables. Also, having caching as "-1" means for every row there will be a RPC call. For example if ABC table has 30 million records that will lead to same amount of calls for each scan.
Finally, we were able to figure out solution for same. We require to set following properties for Hive on Hbase table -
alter table T1 set TBLPROPERTIES('hbase.scan.cacheblock'='false');
alter table T1 set TBLPROPERTIES('hbase.scan.cache'='1000');
By setting above properties Scan data won't be cached, and it will reduce number of RPC calls to HBase. For example, ABC Table with 30 million records will have just 30,000 RPC calls for complete data.
Please refer Spark Application log trace as below -
Before setting these properties-
{"startRow":"","stopRow":"","batch":-1,"cacheBlocks":true,"totalColumns":1,"maxResultSize":"-1","families":{"cf1":["name"]},"caching":-1,"maxVersions":1,"timeRange":["0","9223372036854775807"]}
After setting above properties -
{"startRow":"","stopRow":"","batch":-1,"cacheBlocks":false,"totalColumns":1,"maxResultSize":"-1","families":{"cf1":["name"]},"caching":1000,"maxVersions":1,"timeRange":["0","9223372036854775807"]}
Comments
Post a Comment