bitC-DB: A key Value Store Based on BitCask Storage Format

Pradeep Kumar Singh
6 min readMar 22, 2023

--

In this blog, I will discuss BitCask storage format presented in BitCask’s paper and its one such implementation by me in form of BitC-DB. BitC-DB is a key value store implemented in python following the storage format presented in BitCask’s paper. Please note, bitC-DB implementation is only for educational purposes. The code for the bitC-DB implementation can be found on github.

BitCask

Bitcask is a simple, highly performant key-value storage format used in some database systems, such as Riak. It was designed to provide fast writes and reads, with minimal disk I/O, making it well-suited for use cases that require high write throughput and low latency reads.

The Bitcask storage format uses an append-only file called datafile for writes, where each key-value pair is stored as a record. The records are stored in the order they are written, with each record containing the key, the value, and some metadata. Below image shows the record format stored in data files for each record:

First four fields are called metadata or header and the last two fields are payloads.

CRC: This is the checksum of header and payload combined.

Timestamp: This is the epoch time when entry was appended in the file.

Key Size: Size of the key

Value Size: Size of the value

Key: It is the key specified by user

Value: It is the value specified by user

Once an append-only datafile reaches a certain threshold, it is rotated. During rotation two things happen:

  1. Current append only file is closed and opened in read only mode to avoid further writing to this file.
  2. A new file is created and opened in append only mode.

So at a time there can be only one file where write or read operation will happen, and many other files where only read can happen.

BitCask also maintains an in-memory index named KeyDir(it is a hash map data structure) which holds a mapping between key and its position in a given file.

The below image shows one such entry in KeyDirectory:

Here timestamp is time of record insertion, file_id is the file where record is present, offset is the position in file where record is present and size is the record size.

Below diagram shows the relation between KeyDirectory and datafiles mentioned above:

bitC-DB Operations

Put Operation

When a user invokes put operation with a given value, BitC-DB calculates the checksum, prepares the header. It appends header, key and value as a single record in binary format to the current datafile. The file write call returns offset in the file where the entry was stored. Please note being an append operation , it does not require a disk seek resulting in high write throughput.

BitC-Db then adds an entry in KeyDirectory which points the given key to a position at the returned file offset. Both of these operations are performed atomically.

Get Operation

During the get operation the user gives a key for which they want the value. BitC-Db first checks in its in memory index named KeyDir to find if the given key exists or not. If it does not exist then it simply returns empty value. If it exists then, BitC-DB gets the file_id, offset and size from the index entry and reads the value from the given file at the given offset. After reading it also calculates the CRC again and matches it with the CRC present in the header, if CRC matches then it returns the value otherwise raises an error.

Please note reading a value from the given offset is a random read operation and requires one file seek.

Delete Operation

Delete operation is the same as put operation. When the user specifies a key to be deleted, BitC-Db first checks the in-memory index keyDir and finds out if the key exists. If it does not exist then BitC-DB returns False to the user. If it exists then it appends a special entry in the append only file with value ‘TOMBSTONE’. Once the entry is appended to the data file, it deletes the corresponding entry from the in-memory index called KeyDirectroy.

Its performance is same as put operation and requires no disk seek.

Merge or Compaction

Since all the update and delete operations in BitC-DB are append only, it will create a lot of data files over the time. These files contain lots of obsolete entries that came because of update and delete operations. BitC-Db runs a periodic merge operation on the read only files and creates a new read only file with current entries, i.e. free from duplicate and tombstone entries.

During the merge it reads data files in reverse order of their creation i.e. the latest file is processed first. It creates a map which holds non-duplicate and latest entries. The merge process then creates a new file and writes all entries present in the map to this file. Along the process it also builds a new temporary in-memory KeyDir for this file and later merges this with the current main KeyDirectory. After merging it closes all the older read only files and deletes them. Below diagram explains the same:

Recovery

Since BitC-DB stores all entries to the file, it is easy for it to recover after the crash. It can simply read all the data files and rebuild the in memory index KeyDirectory again. To speed up the process, BitC-Db uses hint files. These hint files are created during the merge operation and there is one-to-one mapping between hint file and datafile. These hint files store only metadata for a key which results in their small size and faster read time during the recovery to rebuild the index.

Below image shows the format of hintfile. It stores the key and its offset in a given datafile.

Strengths and Limitations

Overall, BitCask’s simple design and fast read and write operations make it a popular choice for high-write, low-latency applications where data durability is important. There are some limitations of BitCask storage format as well. Since it stores indexes for all the keys in-memory index named KeyDirectory, key namespace is limited. BitCask storage also does not support range queries efficiently since it has to read for every key in a given range sequentially in different files.

But BitCask can be a better choice where we update a given key’s value frequently rather than inserting a lot of unique keys. It will keep BitCask’s in-memory index under limit.

Happy Reading!! Comments and Suggestions are welcome!

--

--

Pradeep Kumar Singh

Senior Site Reliability Engineer — Google. Views are my own.