Avoid HBase Region Server Hot-spotting
HBase
Hbase architecture follows the master server architecture. Region servers serve data for reading and writing. When accessing data, clients communicate with HBase region servers directly, and HBase Master handles region assignment, creation, and deletion of tables. Hbase uses a Hadoop-distributed file system (HDFS) and stores all data on top of the HDFS files. Also, HBase uses Zookeeper to maintain the distributed tasks and track the cluster's health. The Hadoop Data Nodes store the data of the Region servers, and region servers are collocated with the HDFS data nodes while preserving the data locality. HBase tables are divided horizontally by row key range into Regions, and a region contains all the rows in the table between the Region's start and end keys. Hbase maintains a unique catalog table called the META table, which tracks the location of the areas in the cluster of row keys, and the Zookeeper stores the location of the META table.
HBase Region Server Hot-spotting
Records in Hbase are stored as a sorted list of row keys according to the lexicographic order and allow fast access to an individual record by its key or fast fetching of a range of data between a given start and end row keys. We can refer to row keys with a natural sequence at data insertion time, which can cause region server hot-spotting. By default, HBase stores row with similar keys to the same Region. When records with sequential keys are written to HBase, all these data write hit one Region. So a large amount of client traffic is directed at one node, or only a few nodes, of a cluster. But this would be fine if multiple Region Servers served a Region since the writes hit multiple region servers even though they do as a single region.
But the typical situation is that each Region lives on just one Region Server, and each Region has a predefined maximal size. If a Region reaches that maximal size, it is split into two smaller regions, becoming a hotspot victim because one of these new Regions takes all new records (Limits the write throughput to the capacity of a single server instead of making use of multiple/all nodes in the HBase cluster).
Solutions
Add salt to the Row Key.
Include random data at the start of a row key (randomly assigned prefix) to cause it to sort differently than it otherwise would. Salting prefixes can correspond to the number of regions we want to spread the data.
Use of Hashed Row Key
This approach is suitable when an application reads a query record at a time, and records will spread between multiple Regions/Region servers according to the hash function. Using a deterministic hash allows the client to reconstruct the complete row key and use a Get operation to retrieve that row as usual.
Reverse the row key
Reverse a fixed-width or numeric row key so that the part that changes the most often (the least significant digit) is first. This effectively randomizes row keys but sacrifices row-ordering properties.
Bucketing approach
row_key = (++index % BUCKETS_NUMBER) + "_" + original_key
Where
- index — The numeric (or any sequential) part of the specific record.
- BUCKETS_NUMBER — the number of "buckets" we want our new row keys to be spread across.
- original_key — The original key of the record we want to write.
New row keys of bucketed records will no longer be in one sequence, but records in each bucket will preserve their original sequence. Since data is placed in multiple buckets during writes, we have to read from all of those buckets when doing scans based on "original" start and stop keys and merge data so that it preserves the "sorted" attribute. Scan per bucket can be parallelized so the performance won't be degraded.
In conclusion, hot-spotting in HBase Region Server can be a significant issue when working with large amounts of data. This can result in slow write performance and limit the HBase cluster's capacity to a single server. Several solutions can be applied to avoid hot-spotting, such as adding salt to the Row Key, using Hashed Row Key, reversing the Row Key, and using the Bucketing approach. Each solution has its advantages and disadvantages, but they all aim to spread the data evenly across multiple Region Servers and improve the performance of the HBase cluster. The right solution based on the application's requirements can effectively mitigate hot-spotting in HBase Region Server.