Partitioner for LSM-Based KV Stores

Yue Tan
Princeton Systems Course
8 min readMay 24, 2019


Ashwini Raina and Yue Tan, COS 518, Spring 2019

I. Introduction

Log Structured Merge (LSM) trees are widely deployed in many popular key-value store based databases such as Google’s LevelDB and Facebook’s RocksDB. LSM trees offer high write throughput and good read performance, however to achieve so, they suffer from high write amplification. Write amplification is the ratio of the number of bytes written by the LSM tree to the number of bytes written by the application. Modern SSDs have limited lifetime and high write amplification of LSM trees makes that problem worse.

Most of the workloads seen by LSM trees have some amount of skew, typically zipfian in nature. At Facebook, applications have ziph alpha values ranging from 0.9 to 1.8. Through our write workload experiments on RocksDB, we observe that the skew vs write amplification curve has a knee (as shown in figure below). The goal of this project is to exploit this trade-off in order to reduce write amplification.

Consider a key-value store that has skew of ziph alpha=1.0 in its write workload. The keys for this key-value store might be either range or hash partitioned into multiple shards, and each shard is handled by a separate LSM instance. In the example below, we consider an application with three shards (S1, S2 and S3) handled by three LSM instances. With ziph alpha=1.0, each shard will incur a write amplification of ~14. Cumulatively, this system will incur a cluster level write amplification of 14 + 14 + 14 = 42.


If the skew observed by individual shards is managed carefully, the overall write amplification of the system could be improved. Figure below highlights the idea. By rebalancing the skew observed by three shards, S1 can be nudged more towards the high skew regime, thus lowering its write amplification significantly, while the other two shards don’t incur a high write amplification penalty due to lower skew. In this case, system will incur a cluster wide write amplification of 2+16+15 = 33


II. System Architecture Overview

We propose a new key space Partitioner for LSM based key value stores that optimizes for write amplification. Individual LSM instances keep track of top-k most write popular keys. Partitioner collects the top-k list from each LSM instance using a GetTopK RPC. It merge sorts the top-k lists from each instance into a global list and then generates a new key redistribution plan. Using the Repartition RPC, it sends the relevant plan to each Coordinator responsible for that key space. The Coordinators then move the keys to new LSM instances and delete them from the old ones.

system architecture

III. Technical Challenges

  1. Tracking key popularity — Tracking popularity of billions of keys accurately has a high space overhead i.e. linear in the keyspace. This can take up a lot of precious DRAM space which otherwise could be used for storing memtables, other system caches or even run more LSM instances.
  2. Maintaining high write performance — Tracking key popularity can add system load and doing so in the critical path of writes can hurt performance. Also, Coordinator needs to keep track of redistribution of keys across the cluster. Looking up a key location in the write path needs to be extremely low latency.
  3. Consistency — Design needs to maintain serializability at each Coordinator when the keys are being redistributed in the cluster.
  4. Fault tolerance — System should be fault tolerant under Partitioner, Coordiantor and instance failures.

III.1 Tracking key popularity

In a skewed distribution like Zipfian, only few keys are of interest, specifically the low rank (high frequency) keys. Therefore, tracking popularity of entire keyspace has diminishing returns. To that end, each LSM instance only tracks the top-k popular keys using a popular data stream algorithm called Space Saving. Space Saving algorithm only uses O(k) space and tracks top-k items with high accuracy, especially for zipf data.

III.2 Keep writes performant

critical path of writes and our background running Space Saving algorithm

Using Space Saving algorithm solved the memory efficiency problem, i.e. collecting most frequently written keys has O(k) rather than O(number of keys) space complexity. Then the question is where Space Saving algorithm should be placed.

Our requirement is that Space Saving algorithm should see all put requests issued to the LSM instance. Then one choice is to stream put requests to Space Saving algorithm while they are written to wal/memtable. But this is the critical path of writes (shown above), meaning that any additional operation will add to a put request’s latency clients observe. We must run Space Saving in the background. Fortunately, there are two available choices: 1. run Space Saving during the flushing of a memtable; 2. run Space Saving during the pruning of a WAL (write ahead log).

For the first choice, RocksDB swaps out a memtable when it is full and swaps in an ready-to-be-used empty memtable, making memtable flushing run in the background instead of on the critical path. The drawback of this approach is that we need to modify the memtable data structure to have it maintain a counter for each key it holds. It is technically risky (RocksDB code base is large and complex) and also increases memtable’s memory footprint. In contrast to memtable, where keys get overwritten, WAL is append only — each put request corresponds to one WAL entry. No counter is needed. Because we only need to feed each entry in a WAL to Space Saving algorithm, we landed on the second choice (shown above).

exception lists on top of range partitioning

Another performance related operation is to look up which instance is holding a key.We want the lookup time to be constant.

The naive approach is to maintain key-to-instance mappings for the entire key space. This approach consumes lots of memory. Hence, we chose exception lists, maps only containing mappings for keys that are not at their default location. In the diagram above, the default location is determined by a simple range partitioning and each range has an associated exception list.

III.3 Consistency

timeline of put/get during a key movement

Consistency here means serializability at a Coordinator. Key movement decisions are made by Partitioner but the actual movements are done by Coordinator by updating its exception lists. In our design all puts and gets go through a single Coordinator, so the requests are always sent to the correct location, the location Coordinator stores. In the above diagram, once put(k1, v2) goes to instance 2 showing the exception list has been updated, the second get(k1) will go to instance 2 as well.

IV. Evaluation

  • Testbed: Three RocksDB intances, one Coordinator, and one Partitioner runs as different processes on a single physical machine.
  • RockdDB Parameter: memtable size = 16 MB, level-0 size = 64MB.
  • Workload: a write-only Zipfian workload with alpha=0.9, key space size = 4 million, number of writes = 4 million, key size = 8 B, value size = 4 KB. (Note: a memtable can hold ~4000 unique keys)
  • Baseline: hash partitioning (mod 3) and range partitioning (the key space is divided into three evenly sized ranges).
  • Evaluation Metrics: the cluster-wide write amplification.
per-instance and cluster-wide write amplifications for the two baselines and three different experiments

IV.1 Top-1000 Experiment (yellow bars)

In this experiment, we ran the RocksDB cluster for two minutes using hash partitioning. After two minutes, Partitioner collected each instance’s top 1000 most frequently written keys. Then Partitioner reassigned all the reported 3000 keys to instance 1. Then the cluster ran till the end of the workload.

For the workload we use, hash partitioning produces three similarly skewed partitions. This means that each instance’s top 1000 keys together are the cluster-wide top 3000 keys. After the key movement, instance 1 has high skewness while instance 2 and 3 have low skewness. This corresponds to the yellow columns of the plot above — instance 1 observes lowest write amplification while instance 2 and 3 experience higher write amplification. Unfortunately, however, the cluster-wide write amplification does not drop much compared to baselines. We think this is because instance 1 is not skewed enough.

IV.2 Top-5000 experiment (green bars)

This experiment is same as the top-1000 experiment except that each instance reported its top 5000 most frequently written keys. We think this experiment has worse results because the memtable can only hold ~4000 unique keys and when 15,000 most popular keys are assigned to instance 1, spilling over happens.

IV.3 Top-5000-i1-i2-i3 experiment (orange bars)

In this experiment, each instance reported its top 5000 most frequently written keys but only the top 3000 keys among the reported 15,000 keys were assigned to instance 1 and the next top 1000 keys were assigned to instance 2 and the next top 1000 keys were assigned to instance 3.

The top 3000 keys assigned to instance 1 are the same as those in top-1000 experiment. Since we think instance 1 in top-1000 experiment is not skewed enough, we take away instance 1’s relatively unpopular keys in this experiment by assigning the 1000 keys ranked right after the top 3000 keys to instance 2 and the next 1000 keys to instance 3. In the plot above, the first orange bar is slightly lower than the first yellow bar, confirming that our strategy makes instance 1 more skewed but only slightly. Meanwhile, instance 2 and 3 suffer much more write amplification. Altogether, the cluster-wide amplification of this experiment is slightly worse than the baseline and top-1000 experiment.

V. Conclusion

In order to reduce the cluster-wide write amplification, our goal is to move S1 as far right along the skew-vs-write-amplification curve as possible. However, as shown below, none of the three strategies we tried in the three experiments moved S1 right enough. Therefore, in all three experiments, the cluster-wide write amplification was not reduced. The question whether or how S1 can be moved far right along the skew-vs-write-amplification curve remains open.

our result

Source code at: