Designing Write Heavy Data-store

6 min readJul 8, 2023

In the previous blog, we looked at the pros and cons of relational, document and graph data-stores. We also went through an exercise to choose right type of data store for a hypothetical use-case. It is highly recommended that you go through the previous blog before reading this.

In this blog, we will look into how a data-stores manage data, what choices and compromises it makes to optimise query performance.

A Data-store

Fundamentally, any data-store has 2 operations.

Store given data.
Retrieve data already stored in the data-store.

All the design choices boil down to doing these 2 operations efficiently. To understand the compromises and tradeoffs, Let’s design our own data store. For the time being, we are only looking at achieving durability in our data store and that too in a very restrictive form (i.e. When we turn off power and then turn it back on, data already written to the data-store should be available).

Iteration 1 : A Rudimentary Approach

A very simple approach could be to have a file as our data-store. Whenever we writes to our data-store, we can append a <key, value> pair to the end of the file, and when we need to read the value corresponding a key, We read the entire file and return value corresponding to the last occurrence of the key. This file is known as write-ahead log. A very simple approach indeed.

Let’s see how our simplistic data-store performs under load.

Write performance:

Write performance of our data-store is actually pretty good. To write/update a key in our data-store, we just append data to the end of our file which is very fast.

really? aren’t disk operations slow?

Well it depends.

Most of the time taken in a disk operation is seek time. It is spent by arm to find the exact location of disk from where it has to read/write data. That is why random reads and writes are slow (because most of the time is spent seeking). For sequential reads/writes, there is no seek time needed for each read/write. Once the correct location on disk is located, operating system can keep reading/writing data continuously. This makes sequential writes a lot faster.

Read performance:

Read performance of our data-store is poor. To find any key, We have to scan through the entire file, which is a costly operations. Let’s try to improve the read performance of our data-store.

Iteration 2: Optimising Read Performance

Since we are writing to a single file, the size of the file will keep growing and it will become unmanageable. Let’s pre-decide the size of the file and if the file size grows beyond that limit, we can start writing to a new file. This way the size of each file would be manageable. We will call each file a segment.
Whenever we write <key, value> to the segment, let’s also store the <key, segment, offset> to an in memory cache. We will call this cache index.

How do these 2 optimisations affect read performance?

Well turns out they speed up read operations significantly. Since we are maintaining an index and the index would store the offset to latest value, we will always get the most updated value and since size of each segment is small and we know the offset, we can read just the part of segment, needed to satisfy the read query.

Write performance is also not affected as additional overhead is just one update in the index.

Everything is not rosy though, This approach also has some problems.

Since the index reside in memory and all keys reside in index, the size of data that can be stored in our data-store is limited by the amount of RAM available on the machine. we need to find some way of keeping only a few keys per segment in index.
Say we use this data-store to store views on a youtube video, Each time a someone views the video, we write a key value pair <videoId, views> to our data-store. Although the index would point to latest view count, but all the outdated views count data still resides in our data-store. We need some way of cleaning up outdated data.

Iterations 3: Scaling our Data-store

Suppose if all the <key, value> pairs in each segment were sorted by keys, then we could just store a few keys per segment in index. For a read query, we could then determine the segments and offsets where the value corresponding to the key would be found (if at all). We could then go through all those segments (starting from the newest segment) and return the latest value.

This sorted order also would helps us in cleaning up outdated data from the segments. Since the segments are sorted, We can periodically merge segments keeping only the latest value for multiple updated on a key. This would help in reducing the overall data size on disk.

This sounds like a silver bullet. How do we maintain keys of a segments in sorted order without compromising on write performance? (Remember the write performance of our data-store is good only because our writes a sequential, to maintain sorted order in segment, if we start doing random writes, write performance would become much worse)

Binary Search Tree to rescue:

We can keep self balancing binary search tree (BST) in memory, Whenever we write data, we write it to BST. When BST size crosses a set threshold, We dump the entire BST in a segment, add a few keys from the new segment to index and create a new BST in memory. For a read query, we can first search for the value in BST and if value in not found, then we do an index lookup.

This approach solves the problems of RAM size limiting the size of data store and also has mechanism of cleaning up outdated values.

There are a few problems left in this approach too.

We want our data-store to be durable. What if power goes off before the data in BST could be written to a segment. There is a possibility of data loss.
If a read request comes for a key which is not in our data-store, in the worst case, we potentially could need to scan through all the segments. This query would be very costly for a large data-store.
Merging two segments is an O(m+n) operation. We need some strategy to optimise time taken to merge the segments.

Iteration 4: Further Optimisations

To overcome durability issue, When we add data to BST, we also write it to a write-ahead log file. This log file is used to reconstruct BST if power goes off. Once BST is written to segment, this write-ahead log is discarded and we start with a new write-ahead log for a new in memory BST.
To reduce average cost of read queries where key is not found in data-store, We can maintain bloom filters in our data-store. Bloom filters in most cases can rule out keys which are not present in data-store.
We can follow a strategy of keeping segment size in increasing powers of 2. like 4KB, 8KB, 16KB, 32KB and so on. We will only merge 2 segments of similar size so that the resultant merge time can be optimised.
If we power off our data-store and then power it on, in memory index is lost. To create the index again, we need to scan through all segments. This is very inefficient. We can keep a write-ahead log of index on disk. At startup, this log is read to quickly re-create index.

With these optimisations we have improved read performance of our data-store significantly (compared to iteration 1). This is not a hypothetical design, Many write heavy data-stores like cassendra, use a very similar approach. These data-stores typically treat merging segments as a low priority operation (just like garbage collection in modern programming languages).

Summary:

In this blog, through an exercise we designed a data-store. We started with a very basic write-ahead log based data-store and then refined it in multiple iterations to improve query performance. Many commercially available write heavy data-store also use a very similar approach.

We will see how read heavy/ balanced read-write data-stores manage performance in the next blog.

If you have any queries/feedback, you can write them in comments.