Growing HBase Cluster, and difficulty to get physical hardware is something that every enterprise deals with...
And sometimes, the amount of ingest starts putting pressure on components before new hosts can be added. As, I write this post our cluster was running with 46 nodes and each node having 600 regions per server. This is bad for performance.
You can use the following formula to estimate the number of regions for a RegionServer:
(regionserver_memory_size) * (memstore_fraction) / ((memstore_size) * (num_column_families))
We noticed that HBase doesn't have automatic process to reduce and merge regions. As over the time, many small size or empty regions are formed on cluster which degrades the performance.
While researching to cope up with this problem: we came across following scripts -
- https://appsintheopen.com/posts/51-merge-empty-hbase-regions
- https://blg.robot-house.us/posts/merging-regions-in-hbase/
- It is merging empty regions only.
- Later ruby script creates occasional overlaps.
- It doesn't have check for Degenerated, Splitting, Meta Region.
- It doesn't have size based check, such that merge should not lead to HStore size greater then hregion.max.filesize.
- It is using deprecated Java functions.
- It is merging same region again which many a times lead to error where region is not found.
sh merge_hbase_table.sh "<HBase Table Name>" true |
Java Source code location -
This Utility is utilizes Online Merge https://docs.cloudera.com/runtime/7.2.10/managing-hbase/topics/hbase-online-merge.html , which issues Asynchronous command to Merge Adjacent Regions.
This Utility helped us bring down 18652 regions to 14500 in a few minutes. But, use this utility in off-hours as Merge Regions may invoke minor and major compactions, which in-turn can lead to performance degradation.
Another way is to set appropriate SPLIT_PLOICY on HBaseTable -
alter '<TABLE_NAME>', {'SPLIT_POLICY' => '<SPLIT_POLICY>'}
<SPLIT_POLICY> can be replaced with one of below -
- org.apache.hadoop.hbase.regionserver.ConstantSizeRegionSplitPolicy
A RegionSplitPolicy implementation which splits a region as soon as any of its store files exceeds a maximum configurable size.
- org.apache.hadoop.hbase.regionserver.IncreasingToUpperBoundRegionSplitPolicy
Split size is the number of regions that are on this server that all are of the same table, cubed, times 2x the region flush size OR the maximum region split size, whichever is smaller.
For example, if the flush size is 128MB, then after two flushes (256MB) we will split which will make two regions that will split when their size is 2^3 * 128MB*2 = 2048MB.
If one of these regions splits, then there are three regions and now the split size is 3^3 * 128MB*2 = 6912MB, and so on until we reach the configured maximum file size and then from there on out, we'll use that.
- org.apache.hadoop.hbase.regionserver.BusyRegionSplitPolicy
This class represents a split policy which makes the split decision based on how busy a region is. The metric that is used here is the fraction of total write requests that are blocked due to high memstore utilization. This fractional rate is calculated over a running window of "hbase.busy.policy.aggWindow" milliseconds. The rate is a time-weighted aggregated average of the rate in the current window and the true average rate in the previous window.
- org.apache.hadoop.hbase.regionserver.DelimitedKeyPrefixRegionSplitPolicy
- org.apache.hadoop.hbase.regionserver.KeyPrefixRegionSplitPolicy
- org.apache.hadoop.hbase.regionserver.SteppingSplitPolicy
Comments
Post a Comment