LSM tree is at the heart of most storage systems that provide high write throughput, be it a key-value storage like dynamodb/cassandra or a messaging system like pulsar which is backed by bookkeeper.
The various components of a typical LSM backed system are shown below.
The main reason why LSM provides high write throughput is that every write request is actually performed only “in-memory” in contrast to traditional B-Tree based implementation where the updates are done to disk which can trigger an update to an index making it very expensive.
So the obvious question is, how does LSM achieve durability? that’s where WAL comes into the picture.
WAL is a write-ahead log that is used to provide the durability of data during system failures, what it means is that when a request for write comes in, the data is first added to a WAL file (sometimes called journal) and flushed to the disk (using direct io) before updating the in-memory data structure.
This allows for systems to recover from the WAL if it crashes before persisting the in-memory data structure to disk.
Why not directly update the write to the disk instead of updating WAL? it’s because WAL updates are cheaper as it’s append-only and doesn’t require any restructuring of data on disk.
The in-memory data structure is called a memtable, there are various implementations of memtable, but you can think of memtable as just a binary search tree for the sake of simplicity.
So now for every write request, the data is appended to WAL and the memtable is updated and a successful response is returned to the client.
For java implementations, the memtable is usually stored off-heap (direct memory) to avoid GC load
SSTable (Sorted Strings Table)
As it’s obvious that we cannot keep adding data to memtable to bloat the RAM, the memtable is frequently flushed to disk as an SSTable.
SSTable, as the name indicates, is a sorted array of keys persisted on disk.
The reason it is sorted is to make it easy to look up the data for readings.