A Glimpse into the World of Embedded Database Feat. RocksDB

Kartik Khare
Walmart Global Tech Blog
6 min readNov 20, 2019

RocksDB is a database created by Facebook. It does not support SQL, doesn’t provide ACID guarantees, and can’t run in a distributed fashion. Still, it’s one of the most popular databases in the developer ecosystem. It is used in high-scale frameworks such as Apache Flink and CockroachDB.

So what makes RocksDB a compelling alternative over its competitors?

To understand its use-case, first, we need to take a look into the world of embedded databases.

Embedded database

As the name implies, developers place embedded databases in the service/application using it. It means that if your application is running in container/server A, the database will also be present in the same place. Network calls lead to significant latency. By avoiding them, you can reduce the DB access time by order of magnitudes.

Latency Numbers Every Programmer Should Know by Jonas Bonér

However, the apparent limitation of this design choice is that neither you won’t be able to guarantee the availability of the database nor you will be able to store the data beyond the capacity of your current machine.

Embedded Database Architecture

When should you use an embedded database?

You should use embedded DBs in cases when a lot of data needs to be stored, but the data can be recreated quickly and is not critical. Storing such data in the memory of your application is unreliable whereas storing on disk in some simple format will lead to high latencies.

One of the most popular databases in this domain is SQLite. It is extensively used by Android Applications to store details such as if the user is already logged in or not along with the user id, dob, address, etc. Fetching every data from the server is expensive, while once you close your application, the data in the memory will be gone.

RocksDB

Although SQLite is fine in most cases, it suffers from a critical limitation. SQLite is a single-thread database and doesn’t support concurrent access. This inability to multi-thread makes it perform extremely slow in high throughput scenarios.

Due to this limitation, Google created LevelDB, which supported multi-threading. Facebook used LevelDB as a backbone to create RocksDB. Another popular embedded database that was created during this time is LMDB. LMDB has an entirely different architecture than LevelDB and performs remarkably better in specific scenarios. However, to keep the content short, I’ll be focusing on RocksDB in this article.

Now, let’s take a look at some of the concepts common to LevelDB and RocksDB, which make them much more powerful than their predecessors.

SSTables

Google created BigTable, which used SSTable underneath to store data. The same SSTables are used in RocksDB. SSTables store the data in key-value fashion. The data is also sorted based on keys. Sorting allows fast access to data since it is easy to search in sorted key-range.

SSTables are also immutable by nature. It means that you can read data from SSTable without worrying about it getting modified in mid (if there is a single SSTable).

Both LevelDB and RocksDB share the same table format except that RocksDB contains more types of metadata blocks. The metadata blocks contain file properties such as the dictionary used to prime the compression library or the filters to be used during compaction. You can find the general SSTable format below:

<beginning_of_file>
[data block 1]
[data block 2]
...
[data block N]
[meta block 1]
...
[meta block K]
[metaindex block]
[index block]
[Footer] (fixed size; starts at file_size - sizeof(Footer))
<end_of_file>

The file contains internal pointers. Each such pointer is called a BlockHandle and contains the following information:

offset:   varint64
size: varint64

Log Structure Merged Tree

If you store the data in just a single SSTable, it can grow quite large and thus reduce the access time. Also, you can’t simply create multiple SSTables of fixed sizes as it will lead to the same issue as the number of files grows.

LSM trees were created to solve the above problem. When SSTable of fixed size grows beyond a number, they are merged to form a single SSTable of larger size. This step is known as compaction, which happens in the background. e.g., if the number of 10MB SSTable files go beyond 10, they are merged to create a single SSTable file of 100MB. For the new data, a new SSTable of the smallest size will be created e.g., 1MB.

The newest data is always found in the smallest files, while the oldest data resides in larger files.

Log-Structured Merged Format

Memtables

The DB first inserts the data into a simple sorted in-memory structure known as Memtable as well as a file in disk known as a commit log. This makes accessing recent data extremely fast as it is already present in the memory.

Since the Memtable is already sorted, it is dumped into the disk to create an SSTable.

Memtable, Commit Log and SSTables organization in RocksDB

Why not merely use LevelDB then?

RocksDB improves upon level DB in almost every aspect, which makes it an attractive choice compared to Google’s offering.

Let’s take a look at the measures which RocksDB took to improve upon its predecessors.

Don’t read the non-existant

BloomFilters allows a user to know quickly if a key is present in the database or not. They help in preventing unnecessary reads to the database and make query response much faster.

RocksDB takes the bloom filter approach to the next step. It implements a bloom filter for each Memtable and SSTable apart from the whole DB.

Update your keys without worries

Column families are a mechanism that helps the user to partition the data in a single DB logically. A user can do safe atomic writes to multiple keys across column families.

Column Families share the write-ahead log and don’t share memtables and table files. By sharing write-ahead logs, you get the benefit of atomic writes. By separating memtables and table files, you can configure column families independently and delete them quickly. An example usage is to store metadata in one column family and the actual data in another. Then you can apply a different type of compression and compaction strategy on metadata and data. If required, you can drop the metadata column family while retaining data.

Take a snapshot

Since RocksDB doesn’t provide ACID guarantees, the data you read can change across multiple requests, e.g., due to compaction replacing one value with another.

However, in case your application requires a consistent view of the data, you can create a consistent snapshot in RocksDB. The limitation is it won’t contain any data inserted after creating the snapshot.

Avoid Partial updates

RocksDB provides support for transactions with both pessimistic and optimistic concurrency control.

In pessimistic flow, the DB takes a lock on all the keys which a user is modifying, which makes it safe but slow. Optimistic flow doesn’t take any lock and resolve any possible conflicts due to this at the time of committing.

Other than the major changes mentioned before, RocksDB also provides some additional functionalities:

  • Out of the box Rate Limiter to control the throughput to your DB
  • Multi-threaded compactions which make them faster
  • Multiple compression algorithms such as LZ4, GZIP, SNAPPY, etc.

For more details you can refer to the articles mentioned below:

  1. SSTable and Log Structured Storage: LevelDB
  2. The History of RocksDB
  3. RocksDB tech talk

--

--

Kartik Khare
Walmart Global Tech Blog

Software Engineer @StarTree | Previously @WalmartLabs, @Olacabs | Committer @ApachePinot