HBase Concepts & Best Practices
Introduction
The purpose of this article is to give a crisp & quick brief about HBase internals. This is ideal for people who have some familiarity with HBase or want to use or already using HBase in production and looking for best practices.
Concepts
WAL (Write Ahead Log): A distributed file containing persisted data, for recovery. HBase writes are written to WAL by default.
BlockCache: Read cache in RegionServers, Uses LRU.
MemStore: Write buffer cache in RegionServers , In-memory key Sorted Map.
- Stores un-persisted write data.
- Separate Memstore for each region & column family.
HFiles
- Actual data stored as KeyValues on disk.
- Stores a b+ tree index. where key points to 64Kb block
- Can have Bloom Filters and time intervals for quick filters.
Compactions
- Minor: merging small hfiles into larger hfiles using merge sort. (not necessarily to 1 hfile)
- Major: Merges all hfile to 1 hfile, localise data, removes deleted or expired data. Slows down writes as it re-writes whole data, heavy IO and network traffic task. It slows down write performance. (called write amplification)
- Region Splitting: Region automatically gets split when max size reaches. The region server will inform master that it has split the region. Master might decide it to move to another region-server for load balancing. But the data would reside on primary region-server until major compaction.
Write Path
- Put goes to WAL on disk in a sequential file, used for recovery.
- Then written to MemStore and returns with ack.
- Once there is enough data in MemStore (certain % of global or individual MemStore) 1 hfile per memstore is written to disk, there can be many hfile per region per column family on disk.
- For same region if one memstore is full all memstores for other column families will also be flushed.
How to optimise Hbase for write heavy load
(http://hadoop-hbase.blogspot.com/2014/07/about-hbase-flushes-and-compactions.html)
- WriteBuffer: if auto-flush is turned off then puts are send to region-server once write buffer is full. Defaults to 2MB
- Keep MemStore size high: (less number of flushes to disk)
- Keep number of regions low, and region size big 8GB, 16GB or more. Providing more space for memstore/region. Less Hfiles to merge.
- Can update property hbase.hstore.compactionThreshold to increase number of storeFiles before minor compaction. Might also need to modify hbase.hstore.blockingStoreFiles & hbase.hstore.compaction.max along with first property.
- Can increase hbase.regionserver.thread.compaction.small to bump up number of threads available for minor compaction to streamline compaction.
- Can increase the limit for local and global (hbase.regionserver.global.memstore.upperLimit, hbase.regionserver.global.memstore.lowerLimit) pressure before creating a hfile.
- Presplit Table, add hash prefix (very important, general guidelines)
Read Path
- Client fetches HBase Meta table location from zookeeper, caches information and location and directly goes to region server for query. Only refers to meta table again if query fails. Meta table is a B tree storing ((start key, region id), (region Server)).
- Data to be read from region-server can be in block-cache, MemStore or many hfiles.
- First reads from BloackCache, then from MemStore. If all cells are not present then might need to go to hfiles to read.
- If there are many un-compacted hfiles on disk, read might have to go through all files. This is called read amplification (slowness in reads).
How to optimise for read heavy load (gets/small scans vs large table scans ) (https://www.slideshare.net/lhofhansl/h-base-tuninghbasecon2015ok)
Gets/Small scans
- can increase size of blockCache.
- opposite of Write, as few hfiles would allow faster reads so can decrease number of hfiles before minor compaction starts, this will make sure there is always small number of files at disk to go through while reading.
- Periodically run major compactions (for data locality, small number of hfiles)
Large Scans
- Turn off blockCache, decrease blockCache size
- if you don’t care about gets, it would be ideal even for write heavy, bulk load clusters.
Recovery
- When a region server dies all the regions are quickly assigned to new region-servers.
- WAL is divided per-region based and distributed among region-servers which replay WAL to regenerate the earlier state
Pros/Cons
Pros:
- Strong consistency: Once a write returns all read will see same data.
- Horizontal scaling & auto scaling
- Build in recovery
- integrated with hadoop
Cons:
- WAL replay can be slow, leading to blocked on very slow read and write performance at that time.
- Major compactions (blocking or extremely slow operations)
- Not highly available : if a region-server is latent, the jobs can become latent.
Recommendations and best practices
(http://hbase.apache.org/book.html#table_schema_rules_of_thumb)
- Pre-splitting: Default table is created with single region, meaning all read-writes going to same region server. It leads to heavy performance overhead when region splitting. ( so should avoid automatic region splitting specially for tables with small number of regions.). If there is heavy write load, region might not be able to split and grow to 5–10X of default size. Ideally pre-split with hex splits or custom splits. Decide number of regions based on expected load size.
- Key Prefix hashing: Hot-spotting can be created because of un-even distribution of load or monotonically increasing keys. Having hash prefix makes sure that there is ~99% equal distribution of load.
- Row key construction and Row key size: construct rowKey keeping in mind that all use-case can be satisfied by get or small prefix-scans. Try to keep size of row-key small as row-key is repeated with each column while storing.
- Versioning: HBase allows you to keep version of data, default is 3. Until and unless you are not planning to read old version’s data, always set it to 1.
- Small size region vs large size regions: This depends upon use case, small regions are good if data is not much and you want enough splits for load distribution, or trying to load it using Apache Spark where large regions might not fit or giving bad performance. Otherwise for big enough tables large region sizes are preferred and HBase recommends region size => 4GB. For very large tables size >=16GB.
- No-of regions per table: should be ~ (number of nodes in cluster * x) (where x = some integer). If there are 100s of nodes in HBase cluster then decide some minimum value like 10–20. Otherwise decide based on potential size of table & default region size.
- No of column families (<=3, ideally always 1): As minimum as possible. HBase treats column families like a separate table for all practical purposes. Except that all column family’s data is flushed together in disk creating even more fragmentation. The only benefit is that you can fetch data from 2 or more families in single get/scan call.
- Compression (should be on): Snappy, Gzip both saves huge size on disk. (snappy 4–5 times, G-Zip 7–10 times). Only overhead is compressing and decompressing while write and read which can slightly increase read latencies.
- TTL: if using HBase as cache, Can put a column level or Column Family level TTL, expired rows would be cleaned off at minor compactions.
- Optimised get & Range Scans: if you can limit your scan with range, Time-range and column filter please do so. Can apply column filters on gets too. it would decrease number of seeks on disk, reduce data transfer on network.
- Lakhs of column per rowKey: HBase scales upto lakhs of columns per row-key. But be careful while doing so as if you try to do a get on such a row-key without filters it might lead to get-timeout or OOM issues because of huge data sizes. So rethink your use case if you really need lakhs of columns.
Things not covered
- Co-processors
- GC tuning
- Bulk Loading