An introduction to Apache HBase

SAAD HADDADI
Hands On Apache Hbase
9 min readMay 28, 2020

written By : SAAD HADDADI, Kaoutar Oulahyane , Mohamed Ouftou

As the world becomes more digital and the amount of data to process increases day by day. The usual database management systems became unable to handle and query the data efficiently, these limitations led to the development of new solutions such as HBase which is the main topic of this article.

In this article we are going to cover the basics of HBase and its major components’ functionality.

Overview

History

  • Initially, in Nov 2006, Google released the paper on BigTable.
  • The first HBase prototype was created as a Hadoop contribution in the year Feb 2007.
  • The first usable HBase was released in the same year Oct 2007 along with Hadoop 0.15.0.
  • HBase became the subproject of Hadoop, in Jan 2008.
  • In the year 2010, May HBase became Apache top-level project.

Definition

HBase is a distributed column-oriented database built on top of the Hadoop file system. It is an open-source project and is horizontally scalable.

Apache HBase is a column-oriented key/value data store built to run on top of the Hadoop Distributed File System (HDFS).

Hadoop is a framework for handling large datasets in a distributed computing environment.

So what is a column-oriented database ?

This video showcases the major differences between column-oriented databases and row-Oriented databases in a practical way.

A column-oriented DBMS (or columnar database management system) is a database management system (DBMS) that stores data tables by column rather than by row. Practical use of a column store versus a row store differs little in the relational DBMS world.

Both columnar and row databases can use traditional database query languages like SQL to load data and perform queries. Both row and columnar databases can become the backbone in a system to serve data for common extract, transform, load (ETL) and data visualization tools.

However, by storing data in columns rather than rows, the database can more precisely access the data it needs to answer a query rather than scanning and discarding unwanted data in rows. Query performance is increased for certain workloads.

for example :

A common method of storing a table is to serialize each row of data, like this :

Column-oriented database serializes all of the values of a column together, then the values of the next column, and so on. For our example table, the data would be stored in this fashion:

Features

Scalability

Hbase is horizontally scalable so what do we mean by that?

To understand horizonal scalability we need to compare it with vertical scalability.

Horizontal scalability is the ability to increase capacity by connecting multiple hardware or software entities so that they work as a single logical unit. When servers are clustered, the original server is being scaled out horizontally. If a cluster requires more resources to improve performance and provide high availability (HA), an administrator can scale out by adding more servers to the cluster. An important advantage of horizontal scalability is that it can provide administrators with the ability to increase capacity on the fly. Another advantage is that in theory, horizontal scalability is only limited by how many entities can be connected successfully

Vertical scalability, on the other hand, increases capacity by adding more resources, such as more memory or an additional CPU, to a machine. Scaling vertically, which is also called scaling up, usually requires downtime while new resources are being added and has limits that are defined by hardware.

“What you need to consider while choosing Horizontal scalability on Vertical Scalability

“Scaling horizontally has both advantages and disadvantages. For example, adding inexpensive commodity computers to a cluster might seem to be a cost-effective solution at first glance, but it’s important for the administrator to know whether the licensing costs for those additional servers, the additional operations cost of powering and cooling and the large footprint they will occupy in the data center truly makes scaling horizontally a better choice than scaling vertically.”

Automatic Recovery from Failure using write ahead log(WAL)

  • HFile : stores the rows of data as sorted keyvalue on disk
  • MemStore : is a write cache that stores new data that has not yet been written to disk, there is one memstore per column family
  • A HBase Store hosts a MemStore and 0 or more StoreFiles (HFiles). A Store corresponds to a column family for a table for a given region.
  • The Write Ahead Log (WAL) records all changes to data in HBase, to file-based storage. if a RegionServer crashes or becomes unavailable before the MemStore is flushed, the WAL ensures that the changes to the data can be replayed.

Consistency

Consistency in database systems refers to the requirement that any given database transaction must change affected data only in allowed ways. Any data written to the database must be valid according to all defined rules, including constraints, cascades, triggers, and any combination thereof.

Write transactions are always performed in strong consistency model in HBase which guarantees that transactions are ordered, and replayed in the same order by all copies of the data. In timeline consistency, the get and scan requests can be answered from data that may be stale.

Java API client

The java client API for HBase is used to perform CRUD operations on HBase tables. HBase is written in Java and has a Java Native API. Therefore it provides programmatic access to Data Manipulation Language (DML).

Block cache

HBase supports block cache to improve read performance. When performing a scan, if block cache is enabled and there is room remaining, data blocks read from StoreFiles on HDFS are cached in region server’s Java heap space, so that next time, accessing data in the same block can be served by the cached block. Block cache helps in reducing disk I/O for retrieving data.

Block cache is configurable at table’s column family level. Different column families can have different cache priorities or even disable the block cache. Applications leverage this cache mechanism to fit different data sizes and access patterns.

Bloom filter

A Bloom filter, named for its creator, Burton Howard Bloom, is a data structure which is designed to predict whether a given element is a member of a set of data. A positive result from a Bloom filter is not always accurate, but a negative result is guaranteed to be accurate. Bloom filters are designed to be “accurate enough” for sets of data which are so large that conventional hashing mechanisms would be impractical.

In terms of HBase, Bloom filters provide a lightweight in-memory structure to reduce the number of disk reads for a given Get operation to only the StoreFiles likely to contain the desired Row. The potential performance gain increases with the number of parallel reads.

Using Bloom filters to help reduce the number of I/O operations. The files are all from one column family and have a similar spread in row keys, although only a few really hold an update to a specific row. The block index has a spread across the entire row key range, and therefore always reports positive to contain a random row. The region server would need to load every block to check if the block actually contains a cell of the row or not.

HBase vs RDMS

HBase Architecture

Hbase column oriented storage

Whereas, RDMS store table records in a sequence of rows (Row-oriented databases), HBase is a column-oriented databases, which store table records in a sequence of columns.

  • row key : the reference of the row, it’s used to make the search of a record faster.
  • Column Families : combination of a set of columns. Data belonging to the same column family can be accessed together in a single seek, allowing a faster process.
  • Column Qualifiers: Each column’s name is known as its column qualifier.
  • Cell: the storage area of data. Each cell is connected to a row key and a column qualifiers.

HBase architectural components

HBase has three crucial components:

  • Zookeeper used for monitoring.
  • HMaster Server assigns regions and load-balancing.
  • Region Server serves data for write and read. it refers to different computers in the Hadoop cluster. Each Region Server have a region, HLog, a store memory store

To manage this system, Zookeeper and HMaster works together. Zookeeper verify the status of HMaster, if it’s active, it’ll send a heartbeat of the zookeeper (active HMaster), and to guarantee system tolerance, there is a Inactive HMaster that acts as a backup.

Region servers send a heartbeat signal to Zookeeper to send their status (ready for write and read operations).

Regions servers and HMaster are connected to Zookeeper via a session.

HBase : Read and Write mechanism

In this section we will discuss what happens when a client reads or write data to Hbase.

HBase : read mechanism

There is a special Hbase catalog called the META table which holds the location of the regions in the cluster.

  • The client sends a request to get the region server that hosts the META table from ZooKeeper.
  • The Zookeeper replies by sending the META table location
  • The client will query the META server to get the region server corresponding to the row key it wants to access
  • The client caches this information along side with the META table location
  • Finally the region Server answer with the row key, so now it could get row or rows.

META table

Hbase write Mechanism

These following steps occur in HBase Operations, while the client gives a command to Write:

  1. write the data to the write-ahead-log (WAL), HBase always has WAL to look into, if any error occurs while writing data.
  2. Once the data is written to the WAL, it is then copied to the MemStore
  3. Once the data is placed in the MemStore, the client then receives the acknowledgement (ACK)
  4. When the MemStore reaches the threshold, it dumps or commit the data into HFile

Applications of HBase

Medical

In the medical field, HBase is used for the purpose of storing genome sequences and running MapReduce on it, storing the disease history of people or an area.

Sports

For storing match histories for better analytics and prediction.

E-Commerce

For the purpose of recording and storing logs about customer search history, as well as to perform analytics and then target advertisement for the better business.

Companies Using HBase in 2019

There are many popular companies using HBase, some of them are:

1 .Mozilla

“Mozilla” uses HBase to store all crash data in HBase

2 .Facebook

To store real-time messages, “Facebook” uses HBase storage.

3 .Infolinks

to process advertisement selection and user events for the In-Text ad network, Infolinks uses HBase. It is is an In-Text ad provider company. Moreover, to optimize ad selection, they use the reports which HBase generates as feedback for their production system.

4 .Twitter

A company like Twitter also runs HBase across its entire Hadoop cluster. For them, HBase offers a distributed, read/write the backup of all MySQL tables in their production backend. That helps engineers to run MapReduce jobs over the data while maintaining the ability to apply periodic row updates.

5 .Yahoo!

One of the most famous companies Yahoo! also uses HBase. There HBase helps to store document fingerprint in order to detect near-duplicates.

So, this was all HBase Use Cases. Hope you like our explanation.

Conclusion

Hope you enjoyed reading this article, we will be publishing more stories to dive deeper in HBase with practical examples.

--

--