How To Sound Like You Know RocksDB (1)

What People Talk About When They Talk About RocksDB

Kai Liu
RocksDB for everyone

--

As the kick-off series of this collection, I’d like to give you a gentle bird-view over RocksDB. The primary goal of “How To Sound Like You Know RocksDB” series is to lay a good foundation for prospective RocksDB code readers.

Of course, most of us may just curious what the hell is RocksDB and why it’s gaining increasing attention from the open source communities these days. Well, congratulations, the secondary goal of this series is: after reading these articles, even a lazy guy who knew little RocksDB can talk about it with RocksDB fans or even experts for 15 minutes without being seen through.

Yet-another key-value store?

One of the most frequently asked question about RocksDB was: why do we have one more key-value store? we’ve already have dynamo, redis, etc., why RocksDB shouldered in the already crowded key-value stores’ party.

I’d say, RocksDB has totally different focuses compared with those data stores. Dhruba Borthakur from Facebook had excellent slides explaining what is RocksDB. In a nutshell, RocksDB is a data store that is:

  1. Persistent: you can store you data safely in non-volatile storage.
  2. Designed for fast storage: RocksDB has been optimized to work on flash devices or as a in-memory database; though it’s performance is still pretty good on disk.
  3. Embedded: RocksDB is a library so you can use it as a building block for your own program.

And here are RocksDB’s non-goals:

  1. Distribution
  2. Failover
  3. High availability

No panic, remember RocksDB is just a library and you can implement all the above-mentioned features on top of it. From this perspective, RocksDB can be viewed as a (relatively) low-level data engine that strives for high performance on faster storage.

LSM (Log-structured merge-tree)

Log-structured merge-tree is fundamental idea that fueled RocksDB, LevelDB ( RocksDB’s predecessor), BigTable, Apache HBase, etc.

We are not going to dig deep into LSM. Let’s start with one simple fact: random write (on either disk or flash) is bad.

How bad? From “What are the numbers that every computer engineer should know”, we can find that:

Disk seek: 10,000,000 ns

Read 1 MB sequentially from disk: 30,000,000 ns

That is, in disk we can do around 100 seeks/second, implying that significant overhead will be incurred when we perform a random write on the disk. (Flash devices, though optimized for random reads, still plagued by random writes).

That also explains one questions that puzzled many people when they first take the database class.

If we enabled transaction log (or commit log, redo log, etc.) in a database, for each insertion or update operation, we will:

1. Write the data to the log file.
2. Write the data to the database itself.

The data written to the log file is almost of the same size as that written to database. Will the write time be 2x longer?

The answer is no. This is because writing to log file requires only sequential writes, whereas updating the database may involve multiple random seeks.

So can we be more aggressive and design a database that only write to the log file (since the log file already contains all the truth about the data)?

That’s exactly what motivates LSM.

This is an article about RocksDB. To be pragmatic, I’m not going to talk about the original LSM. Instead, I’ll present a (overly) simplified architecture of RocksDB and illustrate how RocksDB employs that idea to minimize “random writes”.

Again, please keep in mind that the examples below are just the handy simulations. As a first step, let’s just focus on the basic idea and bear with the sub-optimum.

In the simplified RocksDB, there are 2 major components:

  1. Memtable: a buffer that can temporarily host the incoming writes; memtable are normally sorted (in RocksDB, the default memtable implementation is SkipList), but that is not required.
  2. SSTable: once a memtable reached a certain size, it will flush to the storage (to make it simple, let’s assume we are talking disk) and generates a SSTable (sorted static table), which is immutable in its life time.

Optionally, to make sure data written to memtable is safe from process or server crash, we often enable the transaction log, which is essentially a redo log that can help RocksDB to rebuild the memtable after such events.

The figure below shows the basic components within RocksDB.

A simplified RocksDB

But! What if user writes the same key with different values? Simply speaking, by default RocksDB keeps all these copies, as a result:

  • If we perform multiple Put() on the same key, RocksDB won’t do any check but simply put them in the memtable first. Old version of the key-value pairs may be flushed to SSTables.
  • When user do the Get(), we’ll search from memtable to SSTable one by one because key may exist in any of them. So for a query key, it will first search memtable; if not found, search the newest SSTable; if again not found, search the second latest SSTable… Until we find the given key or pass the last file (Please note that this workflow is roughly similar to that of real RocksDB’s, which involves lot’s of complicated, delicate, carefully optimized details).

What! So if we keep writing same keys to the database and it will grow infinitely?

That’s a valid concern! That’s why in RocksDB, we have a background thread that periodically do the “compaction”: merging these SSTables to a bigger SSTable.

As shown in the figure below, SSTable 1 ~ SSTable 4 are being compacted to a bigger SSTable. During the process we’ll remove duplicate keys and even, for advanced usage, we may define our own “compaction filter” to filter out the unwanted key/value pairs as we wish (for example, based on it we can have time-to-live feature for each key-value pair).

RocksDB compaction process

That’s pretty much it! We’ll cover more on the memtable, SSTables, compaction algorithms in other article of this collections.

Further Readings

  1. Dhruba Borthakur’s introductory talk on RocksDB
  2. Original Paper on LSM: The Log-Structured Merge-Tree (LSM-Tree)
  3. RocksDB official site: a bunch of useful resource
  4. RocksDB code repo

--

--

Kai Liu
RocksDB for everyone

RocksDB developer; Big fan of #Linux/#Unix, #Python; Love and hate #C++; School hater -- yet spent over 8 years in 3 universities