Speeding up Block-based Table Lookup

Kai Liu
RocksDB for everyone
6 min readSep 22, 2014

--

Prerequisites

Before reading this post, you may need to know:

  • RocksDB stores its persistent data in SST (Static Sorted Table).
  • There are several types of SST, and in this post we want to talk about the optimization of block-based table — the default table format that works in disks and flash devices.
  • RocksDB supports prefix seek, which has been heavily optimized by RocksDB team.

Problem

One team uses RocksDB as its storage layer, which works perfectly until one day they significantly increased the data size, ensued by soaring CPU usage

From the profiling tool, we identify that the heavy CPU consumption was caused by the binary search in the memtable and block-based table (each ate approximately 30% of the total CPU). As we already had the support for hash-based memtable, the binary search was easily eliminated and CPU usage from memtable had soon dropped to under 1%.

As for block-based table, hmm, there was not a readily available solution. In block-based table, entries are compactly stored in an ascending order. Every 4KB entries will be grouped as a “block”, which is the finest granularity that we write/read data from filesystem.

For every block, we’ll extract its last entry to the index, which will be loaded to the memory as we open the table. When RocksDB performs a lookup for a key k in a table, it will

  1. do a binary search in the in-memory index.
  2. locate the block, do a internal binary search to check if k exists (the internal search is complicated and well optimized, I may discuss this topic in another post).

And the heavy CPU consumption occurs in step 1.

Anther interesting characteristic of this use case is: it only perform prefix lookup and retrieve a group of entry of the same prefix. The concept of prefix may sound too abstract to you. To help you understand this feature, I’d like to give a concrete example.

RocksDB is a key/value database — keys and values are essentially an array of arbitrary length of bytes. Although there is great flexibility to construct your key, one of the common pattern is to use prefix as identifier of the entries of similar properties. For example, we may use <user-id>.<tiemstamp>.<activity-type> as the key for a data store the keeps the user activities. With this scheme, we can perform prefix seek to get the database iterator that points to the first activity of a user in a given time range, and then efficiently perform Iterator::Next() to retrieval subsequent entries.

Solution: V1

After analyzing the use case, we realized the essential problem was how to quickly find the mapping prefix => block id.

Naturally, we considered to add a hash table to remember such mapping. The challenges is, the index data structure does more than random lookup; it also supports range lookup with Prev()/Next().

Instead of providing a radically new index format for block-based table, we decided to add an additional/optional hash layer on top of the existing data structure.

Here’s how it works:

  • When we construct the table, we’ll detect the unique prefixes (this is very easy to do because the entries are already sorted) and store the prefixes and their mapped block ids to a metablock (an extensible block that makes future modification on table format easy) inside of this table.
  • When opening a table, if we enabled hash-prefix lookup, we’ll scan through the materialized prefix=>hash id mapping and construct an in-memory hash map (for v1, we happily use std::unordered_map).
  • When the in-memory index calls Seek(), It will consult the hash table to find the mapped block for a given prefix.

This solution works. It also reduced the expensive binary search from 30% to almost nothing. But we all know the size of hash index is determined by the number of unique prefixes.

For this particular use case, the memory usage had increased by 20%, but that’s fine, we still have much capacity left. In this solution, we care mostly about the cleanness of the newly added semantic and a pivot solution that proves hash solution is the right way to go — and if this solution happens to work, we’d happy to stick to it (however, in retrospection, I think in the API layer, we’d at least warn users of the potential high memory usage).

Little optimization (except the optimization that avoids frequent memory allocation on the heap) is done since we treated it as an experimental feature and want to optimize only after we gain more experience for similar use cases.

Solution: v2

Everything changes. Soon, we discover that in some server there are significantly more unique prefixes, resulting in out-of-memory. So more efficient memory is mandatory right now.

From the profiling result, the size of prefixes (in memory) accounts for 98% of the whole index. There seems to be many ways to optimize the hash table because fromt the profiling results, for example, std::unordered_map is not most space efficient, we can reduce memory usage by 10%~20% by writing a highly specialized hash table. Also, we may compress the prefixes to further cut the memory usage, etc.

Those solution sounds feasible but even 30% reduction seems still not enough. How can we squeeze even more from the memory?

As a matter of fact, back to our initial observation, essentially we only want to speed up the lookup for the mapping prefix => block id. So, do really we need to stored the prefixes at all? The whole purpose of aa prefix in our solution are only to be

  1. Hashed
  2. Compared with the query prefix to see if they match.

Can we skip (2) and live without prefix? Yes, we can!

In our improved solution, instead of using a hash map for the prefix => block id mappings, we only maintain an array (essential a hash set) to store mapped block id (without any prefix being stored).

Workflow for constructing memory friendly hash index:

  • When opening the table, scan through the materialized prefix => block id mappings from the on-disk file.
  • For each prefix, hash it and map it to a hash bucket. We’ll store only the associated block id for that prefix.

For example: if prefix “abc” (maps to block 3), “def” (maps to block 9) are both hashed to “1”, then the hash bucket 1 will be a linked list of “3” and “9”.

The tradeoff is, with the new hash index, when we search prefix “abc”, it may take two searches inside both block block 3 and block 9. And for non-existing prefix “xyz” that happens to be hashed to 1, we don’t have a quick way to determine that “xyz” doesn’t exist; instead, we need to jump into both block 3 and block 9 to see if “xyz” exists.

However, if we hash the prefixes deliberately, there are normally not many block ids inside a bucket (most of them should have 1~2 entries). As a result, we traded off a little lookup performance with a 97% reduction for the hash index.

Lessons learned

  • Start with a simplest solution and don’t do premature optimization: it’s tempting to start with a big upfront designed solution and try to nuke all problem with it. But too often we failed to solve the real problem. Before V1 we don’t even know how much hash table can alleviate the problem. Looking back it’s rewarding to come up with a quick solution to verify our hypothesis, and then based on the improvement, we can come up a follow-up solution with well-directed optimizations.
  • The solution was often simpler than you thought. Sometimes as we are busy solving the problem, we may focus too much on the immediate problems. By taking as step and reviewing the original problem that we want to solve, we can often avoid ourselves getting trapped by the local maximums.

--

--

Kai Liu
RocksDB for everyone

RocksDB developer; Big fan of #Linux/#Unix, #Python; Love and hate #C++; School hater -- yet spent over 8 years in 3 universities