Introduction to HBase’s Data Model
HBase is modeled as:
A “sparse, distributed, consistent, multi-dimensional, sorted map”
We will look at what each of these terms mean below. HBase is based on Google’s BigTable and is currently an Apache top-level project. It provides random read/write access to data stored in HDFS (Hadoop Distributed File System). It leverages the capabilities provided by Hadoop and HDFS. In a future post, we will look at the architecture of how HBase stores data. This post will be more of a high-level introduction to the data model used by HBase
We will start by looking at what each of the terms in the above quote mean and understand the data model using terms that we are already familiar with.
At its core, HBase is a mapping of keys to values. It serves one of the most basic functions of a data store. It stores values, indexed by a key. It retrieves values, given a key.
HBase guarantees that each cell of data is stored lexicographically by its key. This allows for fast range queries (for example: we can ask HBase to return all values with keys from
k1...k4. In contrast, relational databases provides no such guarantee about the sort order of their values.
The key in HBase is actually made up of several parts: row key, column family, column and timestamp. Timestamp is the killer feature of HBase. It provides a way to store several versions of a while, which makes it a good choice for storing data series data. The key-value pair model looks like this now:
(row key, column family, column, timestamp) -> value
HBase is a sparse data store in that it stores nothing for empty/null values. There is no cell for a column without a value. In HBase, null values are free to store.
HBase is built for scale. Data stored in HBase can be spread over many physical machines and can store billions of cells. HBase sits on top of the HDFS, which takes care of the distribution and replication of data. In addition to scalability, this “feature” provides protection again node failures.
HBase is strongly consistent. This means that reads will always return the last written and committed value for a key. HBase guarantees that all changes within the same row are atomic.
Now that we have broken down the canonical definition of HBase, let’s take a look at some of the important terms that describe how data is stored in HBase.
The highest level of organization is the Table. This term is similar to the relational definition of the term. We organize logically independent groups of data into Tables. The diagram below shows an empty Table (we will use this diagram to iteratively build our understanding of the different terms.
Each Table is made up of 1 or more Rows. Rows provide a logical grouping of cells. Row keys are lexicographically sorted. Notice in the diagram below that ‘row-10’ is before ‘row-2’. Row keys can be made up of just bytes, which allows us to use a variety of types of data as the key. Each row will hold the data for a certain entity. The definition of a Row in HBase is similar to its relational counterpart.
Each Row is made up of 1 or more Columns. Columns are arbitrary labels for attributes of a row. In contrast with RDBMS, columns do not need to be specified in advance. As soon as we PUT (insert) a row into HBase, that column is implicitly created. This allows HBase to be a “semi-structured” database by giving it the flexibility to add columns on the fly, rather than declaring them when the table is initially created.
Columns are grouped into Column Families. They define storage attributes for Columns (compression, # of versions etc). Column Families must be declared when a Table is created and must be printable characters. All elements of a column family are stored together on the File System. It is also important to limit the number of Column Families to a relatively small amount (we will see the reason for this in a future post).
At the intersection of a Row, Column Family, Column is a Cell. Each cell contains a value and a version (usually a timestamp). HBase allows the client to store many versions of a single cell, so data that spans over a time period can be modeled easily with HBase. Null values are not stored in Cells (see “Sparse” section above).
Putting it all together
Overall, the data model of HBase is a multi-dimensional key-value store. If you remember one this from this post, it should be this:
(Table, RowKey, Family, Column, Timestamp) -> Value
Or, if you like to think in terms of Java generics: