Apache HBase — a Brief Introduction

Wei Wang
6 min readJul 24, 2021

HBase is an open-source non-relational distributed database modeled after Google’s Bigtable, providing a fault-tolerant way of storing large quantities of sparse data, which is suitable for use cases where low-latency random access and massive storage up to multi-PetaBytes are both required.

HBase employs an shared-storage architecture, separating computation layer from storage layer which typically resides on HDFS (Hadoop Distributed File System). This architecture choice has enabled attracting features like smooth scaling-up and cost optimization, etc. Here are some use cases where HBase may be a good candidate to consider.

  • near-real-time massive distributed Key-Value or Table-schema storage system, with data volume from 100s TerraBytes to 100s PetaBytes, or even higher.
  • high throughput, write-intensive applications that can tolerate fluctuation in latency, about 100s milliseconds.
  • scan by row-key order is a required read pattern.
  • use with combination to Hadoop eco-system, such as MapReduce, Spark, Flink, Hive, HDFS, etc.
  • structured, semi-structured data schema, optionally sparsely distributed.

Some of the specific use cases including:

  • E-commercial ordering system. Daily orders and user actions can be stored in HBase for low-latency random access on massive number of orders. Historical order data can also be stored in HBase to run offline analysis jobs via computing frameworks like MapReduce, Spark, Flink, etc.
  • Search engine. Raw content of web-links along with analytical feature data can be stored in HBase to compute search result ranking, etc.
  • Log aggregation and metrics. OpenTSDB (Open Time Series Database), for example, is one open-sourced database system dedicated to log aggregation and metrics collection.

Some functionality of HBase are listed as below.

  • Bulkload. Use cases such as data migration may have to import large amount of data into HBase. Reading out each line of data and then writing into HBase would be inefficient and sometimes infeasible. HBase provides bulkload that enables direct data access to HDFS, bypassing most of the write path of HBase to improve efficiency.
  • Coprocessor. HBase supports executing user-defined lifecycle hook functions around HBase functions via coprocessor framework, enabling opportunities to extend functionality and allowing fine-grained customization.
  • Filter. HBase allows users to define filtering logic as filter objects and pushes the filtering process down to server side, in order to avoid RPC costs on unnecessary data.
  • MOB. Medium Object Storage focuses on optimizing values sized between 10 to 100 MB.
  • Snapshot. One of the approaches to backup data in HBase, including creating a new table by cloning an existing snapshot.
  • Replication. Data in one HBase cluster can be replicated to another HBase cluster by synchronizing the change history (Write-Ahead Log, to be specific).

How does HBase organize data?

Tables in HBase are semi-structured, which means the column families must be defined before use. A data cell is uniquely identified by the specific rowkey, the specific column family, the exact column qualifier, and then the timestamp (for a specific version in the multi-versioned cells). Data is sparsely stored in HBase and missing columns take no storage space at all.

Hbase supports multi-versioned data, meaning that “a cell” actually consists of multiple cells, one per version, ordered by timestamp descending. For easy understanding, one can roughly consider the data organization as the JSON array below.

// table name format:"${namespace}:${table}"// e.g. "default:test_table"[  "rowKey1": { // rowkey locates a row    "cf1": { // column family must be pre-defined in table schema      "cq_a": { // column qualifier can be arbitrary value        "timestamp3": "value3", 
// a cell is located by (row=rowKey1, column="cf1:cf_a", timestamp=timestamp2)
"timestamp2": "value2", "timestamp1": "value1" }, "cq_b": { "timestamp2": "value2", "timestamp1": "value1" } }, "cf3": { "cq_m": { "timestamp1": "value1" }, "cq_n": { "timestamp1": "value1" } }, }, "rowKey3": { "cf2": {
// sparse storage. missing column families take no storage space
"cq_x": { "timestamp3": "value3", "timestamp2": "value2", "timestamp1": "value1" }, "cq_y": { "timestamp1": "value1" } }, }]

HBase Architecture

HBase cluster architecture

An Hbase cluster consists of below components.

  • Zookeeper. In order to achieve consensus agreements in a distributed system, Zookeeper is used for HBase master leader election, service discovery, and distributed task management, etc.
  • Master. HBase master manages metadata of the cluster and schedules region servers like liveness check, rebalance, etc. Only the leader instance of master is in charge, the follower instances are stand-by and takes over via leader election when the leader is down, hence achieving high availability.
  • Region Server. Each region server instance serves data within a disjoint contiguous range of rowkey, and together all region servers covers the complete rowkey space.
  • Thrift Server. HBase also provides a Thrift protocol to manipulate data. Thrift requests are proxied by thrift servers to region servers.
Components in an HBase cluster

How does HBase client perform a read/write request?

The HBase client is configured with Zookeeper information corresponding to the target HBase cluster.

  1. The client requests Zookeeper to find the region server, A, serving metadata table, ‘hbase:meta’.
  2. The client then requests region server A to lookup in ‘hbase:meta’ table the region server, B, that’s serving the target rowkey.
  3. Finally, the client requests region server B to manipulate data of the target rowkey.

The location information is then cached in the client to avoid the lookup cost.

The procedure of HBase client locating region server serving the target row

What happens when we read/write data to HBase?

HBase implements a data structure called Log-Structured Merge tree.

When a write request is received, the change record is first recorded in Write-Ahead Log which is an append-only log persistently stored on disk. Then the data is written to MemTable, an in-memory data structure that keeps rows in order and supports binary search and forward and backward iteration of rows, typically implemented by ConcurrentSkipListMap.

Once the size of MemTable reaches a configured threshold, a flush is triggered to write MemTable data into a StoreFiles stored on HDFS. By the way, each column family has its own MemTable and StoreFiles.

Since the data of a column family can be stored in MemTable or multiple StoreFiles within a region server, how do we read a row then? A merged iterator is constructed to combine all the StoreFiles plus the MemTable together to form a globally ordered view. First, an iterator is created per StoreFile/MemTable which can iterate rows by order within the StoreFile/MemTable. Second, a min heap is constructed on these iterators, using comparator as the current row of the iterators. Therefore it is assured that current row from the iterator on top of the min heap is the next row in global order.

Write-Ahead Log in HBase
Struct of ConcurrentSkipListMap

Best Practices

  • Rowkey length. The content of the rowkey is stored within each cell belonging to the row, hence causing redundant storage usage. So the rowkey length should be as small as possible given that it still provides sufficient information.
  • Hotkey problem. Since each region server serves data of contiguous rowkey range, a.k.a. region, there could be overloaded region server instances if most of the throughputs are for a specific subset of the rowkey space, causing service instability or even becoming unavailable.
  • Column family count. Typically, column family count should be limited to no more than 3. Too many column families could cause performance drawback.
  • Value size. Usually value size should be limited to below 10 MB, unless the MOB feature is applied.
  • Number of Regions. The average data size of a region should be between 10 to 50 GB.
  • Reuse client instance as much as possible. As we know that the client needs to perform 3 requests in order to locate the target row, it would be pretty inefficient if we created a new client instance per request. What’s more, high throughput on meta region server and Zookeeper can be a single point of failure and cause the cluster unavailable.

References

https://help.aliyun.com/document_detail/49503.html

http://hbasefly.com/category/hbase/

https://niyanchun.com/hbase-introduction-extend.html

--

--

Wei Wang

Distributed Storage Systems R&D. C++/Java/Scala/Golang/Python