How Was HBase Used To Solve Storage Issues In Facebook Messenger?

Published in

Edureka

11 min readNov 14, 2016

As I have mentioned in my Hadoop Ecosystem blog, HBase is an essential part of our Hadoop ecosystem. So now, I would like to take you through HBase tutorial, where I will introduce you to Apache HBase, and then, we will go through the Facebook Messenger case-study. We are going to cover the following topics in this HBase tutorial blog:

History of Apache HBase
Introduction of Apache HBase
NoSQL Databases and its types
HBase vs Cassandra
Apache HBase Features
HBase vs HDFS
Facebook Messenger Case Study

History of Apache HBase

Let us start with the history of HBase and know how HBase has evolved over a period of time.

Apache HBase is modeled after Google’s BigTable, which is used to collect data and serve request for various Google services like Maps, Finance, Earth etc.
Apache HBase began as a project by the company Powerset for Natural Language Search, which was handling massive and sparse data sets.
Apache HBase was first released in February 2007. Later in January 2008, HBase became a sub-project of Apache Hadoop.
In 2010, HBase became Apache’s top-level project.

After knowing about the history of Apache HBase, you would be curious to know what is Apache HBase? Let us move further and take a look.

Introduction to HBase

HBase is an open-source, multidimensional, distributed, scalable, and a NoSQL database written in Java. HBase runs on top of HDFS (Hadoop Distributed File System) and provides BigTable like capabilities to Hadoop. It is designed to provide a fault-tolerant way of storing a large collection of sparse data sets.

Since HBase achieves high throughput and low latency by providing faster Read/Write Access on huge datasets. Therefore, HBase is the choice for the applications which require fast & random access to a large amount of data.

It provides compression, in-memory operations and Bloom filters (data structure which tells whether a value is present in a set or not) to fulfill the requirement of fast and random read-writes.

Let’s understand it through an example: A jet engine generates various types of data from different sensors like pressure sensor, temperature sensor, speed sensor, etc. which indicates the health of the engine. This is very useful to understand the problems and status of the flight. Continuous Engine Operations generates 500 GB data per flight and there are 300 thousand flights per day approximately. So, Engine Analytics applied to such data in near real-time can be used to proactively diagnose problems and reduce unplanned downtime. This requires a distributed environment to store a large amount of data with fast random reads and writes for real-time processing. Here, HBase comes for the rescue.

As we know, HBase is a NoSQL database. So, before understanding more about HBase, lets first discuss the NoSQL databases and its types.

NoSQL Databases

NoSQL means Not only SQL. NoSQL databases are modeled in a way that it can represent data other than tabular formats, unlike relational databases. It uses different formats to represent data in databases and thus, there are different types of NoSQL databases based on their representation format. Most of NoSQL databases leverages availability and speed over consistency. Now, let us move ahead and understand the different types of NoSQL databases and their representation formats.

Key-Value stores:

It is a schema-less database that contains keys and values. Each key, points to a value which is an array of bytes, can be a string, BLOB, XML, etc. e.g. Lamborghini is a key and can point to a value Gallardo, Aventador, Murciélago, Reventón, Diablo, Huracán, Veneno, Centenario etc.

Key-Value stores databases: Aerospike, Couchbase, Dynamo, FairCom c-tree ACE, FoundationDB, HyperDex, MemcacheDB, MUMPS, Oracle NoSQL Database, OrientDB, Redis, Riak, Berkeley DB.

Use-case

Key-value stores handle size well and are good at processing a constant stream of read/write operations with low latency. This makes them perfect for User preference and profile stores, Product recommendations; latest items viewed on a retailer website for driving future customer product recommendations, Ad servicing; customer shopping habits result in customized ads, coupons, etc. for each customer in real-time.

Document-Oriented:

It follows the same key-value pair, but it is semi-structured like XML, JSON, BSON. These structures are considered as documents.

Document Based databases: Apache CouchDB, Clusterpoint, Couchbase, DocumentDB, HyperDex, IBM Domino, MarkLogic, MongoDB, OrientDB, Qizx, RethinkDB.

Use-Case

As the document supports flexible schema, fast read-write and partitioning makes it suitable for creating user databases in various services like Twitter, e-commerce websites etc.

Column-Oriented:

In this database, data is stored in cell grouped in the column rather than rows. Columns are logically grouped into column families which can be either created during schema definition or at runtime.

These types of databases store all the cell corresponding to a column as continuous disk entry, thus making the access and search much faster.

Column Based Databases: HBase, Accumulo, Cassandra, Druid, Vertica.

Use-Case

It supports the huge storage and allows faster read-write access over it. This makes column-oriented databases suitable for storing customer behaviors in an e-commerce website, financial systems like Google Finance and stock market data, Google maps etc.

Graph Oriented:

It is a perfect flexible graphical representation, used, unlike SQL. These types of databases easily solve address scalability problems as it contains edges and node which can be extended according to the requirements.

Graph-based databases: AllegroGraph, ArangoDB, InfiniteGraph, Apache Giraph, MarkLogic, Neo4J, OrientDB, Virtuoso, Stardog.

Use-case

This is basically used in Fraud detection, Real-time recommendation engines (in most cases e-commerce), Master data management (MDM), Network and IT operations, Identity and access management (IAM), etc.

HBase and Cassandra are the two famous column-oriented databases. So, now talking it to a higher level, let us compare and understand the architectural and working differences between HBase and Cassandra.

HBase VS Cassandra

HBase is modeled on BigTable (Google) while Cassandra is based on DynamoDB (Amazon) initially developed by Facebook.
HBase leverages Hadoop infrastructure (HDFS, ZooKeeper) while Cassandra evolved separately but you can combine Hadoop and Cassandra as per your needs.
HBase has several components which communicate together like HBase HMaster, ZooKeeper, NameNode, Region Severs. While Cassandra is a single node type, in which all nodes are equal and performs all functions. Any node can be the coordinator; this removes Single Point of failure.
HBase is optimized for read and supports single writes, which leads to strict consistency. HBase supports Range based scans, which makes scanning process faster. Whereas Cassandra supports single row reads which maintain eventual consistency.
Cassandra does not support range based row scans, which slows the scanning process as compared to HBase.
HBase supports ordered partitioning, in which rows of a Column Family are stored in RowKey order, whereas in Casandra ordered partitioning is a challenge. Due to RowKey partitioning, the scanning process is faster in HBase as compared to Cassandra.
HBase does not support read load balancing, one Region Server serves the read request and the replicas are only used in case of failure. While Cassandra supports read load balancing and can read the same data from various nodes. This can compromise the consistency.
In CAP (Consistency, Availability & Partition -Tolerance) theorem HBase maintains Consistency and Availability while Cassandra focuses on Availability and Partition -Tolerance.

Now let’s take a deep dive and understand the features of Apache HBase which makes it so popular.

Features of HBase

Atomic read and write: On a row level, HBase provides atomic read and write. It can be explained as, during one read or write process, all other processes are prevented from performing any read or write operations.
Consistent reads and writes: HBase provides consistent reads and writes due to above feature.
Linear and modular scalability: As data sets are distributed over HDFS, thus it is linearly scalable across various nodes, as well as modularly scalable, as it is divided across various nodes.
Automatic and configurable sharding of tables: HBase tables are distributed across clusters and these clusters are distributed across regions. These regions and clusters split, and are redistributed as the data grows.
Easy to use Java API for client access: It provides easy to use Java API for programmatic access.
Thrift gateway and a REST-ful Web services: It also supports Thrift and REST API for non-Java front-ends.
Block Cache and Bloom Filters: HBase supports a Block Cache and Bloom Filters for high volume query optimization.
Automatic failure support: HBase with HDFS provides WAL (Write Ahead Log) across clusters which provide automatic failure support.
Sorted row keys: As searching is done on the range of rows, HBase stores row keys in lexicographical order. Using these sorted row keys and timestamp, we can build an optimized request.

Now moving ahead in this HBase tutorial, let me tell you what are the use-cases and scenarios where HBase can be used and then, I will compare HDFS and HBase.

I would like to draw your attention toward the scenarios in which the HBase is the best fit.

Where we can use HBase?

We should use HBase where we have large data sets (millions or billions of rows and columns) and we require fast, random and real-time, read and write access over the data.
The data sets are distributed across various clusters and we need high scalability to handle data.
The data is gathered from various data sources and it is either semi-structured or unstructured data or a combination of all. It could be handled easily with HBase.
You want to store column-oriented data.
You have lots of versions of the data sets and you need to store all of them.

Before I jump to the Facebook messenger case study, let me tell you what are the differences between HBase and HDFS.

HBase VS HDFS

HDFS is a Java-based distributed file system that allows you to store large data across multiple nodes in a Hadoop cluster. So, HDFS is an underlying storage system for storing the data in the distributed environment. HDFS is a file system, whereas HBase is a database (similar as NTFS and MySQL).

As Both HDFS and HBase stores, any kind of data (i.e. structured, semi-structured and unstructured) in a distributed environment so let's look at the differences between HDFS file system and HBase, a NoSQL database.

HBase provides low latency access to small amounts of data within large data sets while HDFS provides high latency operations.
HBase supports random read and writes while HDFS supports WORM (Write once Read Many or Multiple times).
HDFS is basically or primarily accessed through MapReduce jobs while HBase is accessed through shell commands, Java API, REST, Avro or Thrift API.

HDFS stores large data sets in a distributed environment and leverages batch processing on that data. E.g. it would help an e-commerce website to store millions of customer’s data in a distributed environment which grew over a long period of time(might be 4–5 years or more). Then it leverages batch processing over that data and analyzes customer behaviors, pattern, requirements. Then the company could find out what type of product, customer purchase in which months. It helps to store archived data and execute batch processing over it.

While HBase stores data in a column-oriented manner where each column is stored together so that, reading becomes faster leveraging real-time processing. E.g. in a similar e-commerce environment, it stores millions of product data. So if you search for a product among millions of products, it optimizes the request and search process, producing the result immediately (or you can say in real time).

As we know HBase is distributed over HDFS, so a combination of both gives us a great opportunity to use the benefits of both, in a tailored solution, as we are going to see in the below Facebook messenger case study.

Facebook Messenger Case Study

Facebook Messaging Platform shifted from Apache Cassandra to HBase in November 2010.

Facebook Messenger combines Messages, email, chat and SMS into a real-time conversation. Facebook was trying to build a scalable and robust infrastructure to handle a set of these services.

At that time the messaging infrastructure handled over 350 million users sending over 15 billion person-to-person messages per month. The chat service supports over 300 million users who send over 120 billion messages per month.

By monitoring the usage, they found out that, two general data patterns emerged:

A short set of temporal data that tends to be volatile
An ever-growing set of data that rarely gets accessed

Facebook wanted to find a storage solution for these two usage patterns and they started investigating to find a replacement for the existing Messages infrastructure.

Earlier in 2008, they used the open-source database, i.e. Cassandra, which is an eventual-consistency key-value store that was already in production serving traffic for Inbox Search. Their teams had great knowledge in using and managing a MySQL database, so switching either of the technologies was a serious concern for them.

They spent a few weeks testing different frameworks, to evaluate the clusters of MySQL, Apache Cassandra, Apache HBase, and other systems. They ultimately selected HBase.

As MySQL failed to handle the large datasets efficiently, as the indexes and datasets grew large, the performance suffered. They found Cassandra unable to handle difficult pattern to reconcile their new Messages infrastructure.

The major problems were:

Storing the large sets of continuously growing data from various Facebook services.
Requires Database which can leverage high processing on it.
High performance needed to serve millions of requests.
Maintaining consistency in storage and performance.

Challenges Faced By Facebook Messenger - HBase Tutorial

For all these problems, Facebook came up with a solution i.e. HBase. Facebook adopted HBase for serving Facebook messenger, chat, email, etc. due to its various features.

HBase comes with very good scalability and performance for this workload with a simpler consistency model than Cassandra. While they found HBase to be the most suitable in terms of their requirements like auto load balancing and failover, compression support, multiple shards per server, etc.

HDFS, which is the underlying file system used by HBase also provided them with several needed features such as end-to-end checksums, replication, and automatic load re-balancing.

*HBase as a solution to Facebook messenger - HBase Tutorial*

As they adopted HBase, they also focused on committing the results back to HBase itself and started working closely with the Apache community.

Since messages accept data from different sources such as SMS, chats, and emails, they wrote an application server to handle all decision making for a user’s message. It interfaces with a large number of other services. The attachments are stored in a Haystack (which works on HBase). They also wrote a user discovery service on top of Apache ZooKeeper which talks to other infrastructure services for friend relationships, email account verification, delivery decisions, and privacy decisions.

The Facebook team spent a lot of time confirming that each of these services is robust, reliable and providing good performance to handle a real-time messaging system.

I hope this HBase tutorial article is informative and you liked it. In this article, you got to know the basics of HBase and its features.

If you wish to check out more articles on the market’s most trending technologies like Artificial Intelligence, Python, Ethical Hacking, then you can refer to Edureka’s official site.

Do look out for other articles in this series which will explain the various other aspects of Big data.

1.Hadoop Tutorial
2.Hive Tutorial
3.Pig Tutorial
4.Map Reduce Tutorial
5.Big Data Tutorial
6.HDFS Tutorial
7.Hadoop 3
8.Sqoop tutorial
9.Flume tutorial
10.Oozie tutorial
11.Hadoop ecosystem
12.Top Hive Commands with Examples in HQL
13.Hadoop Cluster With Amazon EMR?
14.Big Data Engineer Resume
15.Hadoop Developer- Job Trends and Salary
16.Hadoop Interview Questions

Originally published at www.edureka.co on November 14, 2016.