Log Structured Merge Trees

John Pradeep
Feb 11, 2020 · 4 min read

LSM tree is at the heart of most storage systems that provide high write throughput, be it a key-value storage like dynamodb/cassandra or a messaging system like pulsar which is backed by bookkeeper.

The various components of a typical LSM backed system are shown below.

The main reason why LSM provides high write throughput is that every write request is actually performed only “in-memory” in contrast to traditional B-Tree based implementation where the updates are done to disk which can trigger an update to an index making it very expensive.

So the obvious question is, how does LSM achieve durability? that’s where WAL comes into the picture.

WAL

This allows for systems to recover from the WAL if it crashes before persisting the in-memory data structure to disk.

Why not directly update the write to the disk instead of updating WAL? it’s because WAL updates are cheaper as it’s append-only and doesn’t require any restructuring of data on disk.

MemTable

So now for every write request, the data is appended to WAL and the memtable is updated and a successful response is returned to the client.

For java implementations, the memtable is usually stored off-heap (direct memory) to avoid GC load

SSTable (Sorted Strings Table)

SSTable, as the name indicates, is a sorted array of keys persisted on disk.

The reason it is sorted is to make it easy to look up the data for readings.

Okay, now that is the essence of how LSM provides high throughput using a WAL, MemTable & SSTable.

Usually, even every delete request for a key is also added to memtable with a marker indicating it’s deleted and the same information is flushed to the SSTable.

Compactor

This is the job of a compactor which usually runs in the background, It merges SSTables by removing redundant & deleted keys and creating a compacted/merged SSTables.

the compactor also is responsible for updating an index (typical B-Tree based index) to locate SSTable a key is present in.

Index

Also, the size of SSTables is chosen in such a way that it corresponds to the operating system page size (usually multiples of disk bock size) making it easier to load the data to memory faster.

Although the Index along with SSTable help in faster lookup of keys, all read requests are first consulted in the memtable as it should contain the most recent change. If the key is not in the memtable, then the index is used to identify the possible SSTable the key may be present and then search inside the SSTable in-memory.

Since every read has to check the memtable, index & SSTable to look for a key, it makes read requests very expensive especially for keys that are not present!

For keys that are recently updated, the read request will easily locate it in the memtable, but for keys not recently updated, and for keys that are not present in the system the read path is expensive!

Bloom Filters are used to improve read performance especially for the cases where the key is not present in the system.

Bloom Filter

With bloom filter, False positive match is possible, which means, it may indicate a key is present although it’s not in the system. But false-negative match won’t happen, which means if bloom filter indicates a key is not present, then it is definitely not present in the system, so we could avoid taking the expensive read path.

So the presence of a bloom filter improves the read performance for keys that are missing in the system but for the keys that are present in the system, the read is still expensive as we have to look into the memtable, index and the SSTable.

Summary

The Startup

Get smarter at building your thing. Join The Startup’s +786K followers.

Sign up for Top 10 Stories

By The Startup

Get smarter at building your thing. Subscribe to receive The Startup's top 10 most read stories — delivered straight into your inbox, once a week. Take a look.

By signing up, you will create a Medium account if you don’t already have one. Review our Privacy Policy for more information about our privacy practices.

Check your inbox
Medium sent you an email at to complete your subscription.

John Pradeep

Written by

The Startup

Get smarter at building your thing. Follow to join The Startup’s +8 million monthly readers & +786K followers.

John Pradeep

Written by

The Startup

Get smarter at building your thing. Follow to join The Startup’s +8 million monthly readers & +786K followers.

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store