A lot of databases use LSM trees for storing the data where high write throughput is required. In this article we will go through the design and data structures used.
For the sake of the argument, let us assume we only have key — value pairs. Also, for the sake of simplicity let’s assume the key and values are strings.
Append only Log File
To ensure high writes we need to have sequential writes and reduce random writes. Let us start with writing the key-values to an append only file.
We also maintain an in memory hash map which contains the key and value as offset to the file from the beginning. Whenever we update a value, we update the hashmap to the new offset in the file.
For put(key,value) we simply append to the file.
For get(key) we check for the offset in the hashmap and do a disk seek to the given offset for the value.
For deletes, we write a tombstone entry in the log file and update the same in the hashmap. We need tombstone as we need to have information that the key was deleted. This comes in handy when we rebuild the hashmap after a failure and also in case of compaction.
If we continue to write to the file, we eventually will run out of space. Hence, we need
Compaction essentially means removing all instance of a key from the log file, except the latest one.
Let us start breaking the file into segments and start a new file, when the previous file has reached a specified threshold of size or keys or any other parameter.
Also, let us keep a hashmap for each segment.
In compaction, we take 2 or more segments, open a new file, and insert all keys (only once) with the latest value, in the new file and also create a new index against this new file. We need tombstone entries to remove the keys that have been deleted while compacting.
Once the new file is created, we discard the older files and the older indexes.
SSTables or Sorted String Table
SSTable is a log file with the following 2 properties.
a) It contains records in sorted order, i.e. the key value pair in the log file are sorted by key.
b) It contains a key only once.
SSTables offer better performance over the above log file, as we can now have sparse indexes. If the keys are sorted, we can have fewer keys in the index map. For a given key not in the index (in memory map), we can lookup the nearest key that is available and do a lookup in the file from that offset.
Also, such a structure now allows for range queries and allows for data between any given keys.
How to create SSTables?
Lets start by maintaining an in memory data structure that stores key value pairs, and stores the keys in sorted order while removing duplicates. We can use AVL Tree or Red Black Trees. You can think of these trees as a self balancing binary search tree. A binary search tree where the difference of the height of the leaf nodes is not greater than 1.
Whenever a write comes in, add it to the in memory balanced tree structure, which is also called as a memtable.
Whenever the size of the in memory tree increases a specified threshold, the tree gets written to the disk in the form of an SSTable. This file satisfies the above properties.
Taking a side track. Lets say we need to check if a key exists in a list or not. Let us also state that we can afford a false positive, i.e. where the check states a key exists, but actually it doesn’t, but we cannot afford false negative, i.e. where the check states a key doesn’t exist, but it does.
If we check for the key in a list using :-
Linear Search — This will be O(n)
Binary Search — This will be O(lg n)
Let us take a m bit array. Let each bit be set to 0.
Also, let us assume k hash functions. Recall that a hash function returns a value between 0-m for any given input.
For a given input say input (wow, what a creative name).
For each hash function hj (where j = 1 to k) (in above we have k = 2, i.e. 2 hash functions)
Set m[hj(inp)] = 1 , i.e. we calculate the hash of input using all the hash functions. For each hash function the index returned is set to 1. In above the hash functions return index 1 and 4.
To check whether a key exists or not, we check if all indexes returned by the hash function are set to 1 or not. In above example when checking for input1 (another creative name), we check if m[h1(input1)]…..m[hk(input1)] all are 1 or not.
Complexity = O(c) + pO(n) where p = probability of a false negative, i.e. the bloom filter states a key exists when it doesn’t. For determined values of m and hash functions, p can be kept small enough to have the complexity as constant.
There is no delete mechanism from a bloom filter.
Getting back on track
Log Structured Merge Trees (LSM Trees)
Now we have all the basic building blocks to create a LSM Tree.
We follow the below steps on a write.
- Whenever we get a write, we first write it to an append only log file. This is done to ensure we can recover after a crash, i.e. when the memtable is lost, we use this append only log file to create the memtable.
- We write the data to a memtable.
- We update the bits in the Bloom Filter for this key
- Flush when memtable exceeds the size (defined below).
We follow the below steps on a read.
- We check the bloom filter for existence of the key. This avoids checking all the indexes for the existence of a key.
- We check the memtable for the key. The most recent write would be here.
- We check for the key in the index maps (in decreasing order) and find the file and offset containing the value.
Few other operations
In order to ensure that the in memory data structure does not exceed a given size, we need to flush it to the disk from time to time.
When the memtable grows beyond a specified size, all its content is flushed to a SSTable on disk.
- Memtable is written to disk.
- Append only file is discarded and a new append only file is created.
- An index hashmap is created against this file, containing the key and offset from the start of the file. Each SSTable has its own index.
- A new memtable is created and all new writes are written to this memtable and the new append only file.
With time the number of files / segments will grow. In order to reduce the disk space, compaction would be required from time to time.
The compaction can be done by using merge sort. The SSTables can be merged using k-way merge, and written to a new file. The index and older file are discarded and the new segment and new index is used to serve reads.
Why did we need bloom filters?
When checking for a key that does not exist, we would need to check over multiple indexes. When the number of indexes (equals the number of segments) is high, for a key that doesn’t exist will need lookups in all of the indexes. Bloom filter works to reduce these number of lookups in these indexes.
Though the above solution is created to improve the write throughputs, in general it depends upon a lot of other factors, specially the work load. This article is to provide an overview and an understanding of how LSM trees work. These trees are now used under a lot of databases such as Cassandra, InfluxDB etc.