Lessons learnt from writing my own Database Engine — PART 1

muayyad alsadi
7 min readMay 27, 2024

--

The first time I encountered CDB (Constant Database) was more than 10 years ago, when I was contributing to a search engine called Whoosh (something like Apache Lucene for Python) as it used a variation of CDB as a storage engine for its on-disk search segment storage (called FileTables of FileDB). I was shocked how clean and simple it’s. CDB spec fits on a single page. I tried to do my own implementation and do some experiments and here what I’ve learned.

What is CDB? And How does it work?

Basically, a CDB file is a fast immutable on-disk key-value database. How fast is Fast?Assuming no hash collisions it offers constant access time O(1) that is no matter how large the database is (thousands, millions, or billions of keys) access time will be the same time. It’s much faster when compared to DBM-like file formats (like GNU dbm, ndbm and Berkeley DB, OpenLDAP’s LMDB, LevelDB, RocksDB) which I believe is O(Log(n)) .

It’s immutable that is written once at once. It’s on-disk two levels “hashtables”, the first level is fixed size that is 256, the second level have variable size. It works like this, Data is stored in the file after fixed sized header in the form of:

  • 32-bit key length
  • 32-bit value length
  • the actual key
  • the actual value

While we are adding keys and values, we keep track of what offset it was written on and the 32-bit hash of the key. The first 8-bits (0–255) indicate which hashtable to put it on. The rest bits modulo hashtable size indicate where to place it within the hashtable. All this is calculated after inserting all keys, so the size of each hashtable is known and fixed.

Some variations support fast ordered reads similar to B* trees, other’s support MMAP files (entire DB must fit in memory), others’s extend it to 64-bit, other’s put the fixed 256 points at the end to avoid backward seek…etc.

Although CDB is designed for constant immutable data. Future inserts can be implemented by having multiple files (in Solr/Whoosh terms it’s called segments). One file is kept constant while the other receives the inserts/updates. Then merge is done between the two.

My Implementation Trials

I wanted to test if I can do:

  • a correct implementation
  • benchmark different total number of keys (1k, 10k, 100k, 1m keys)
  • benchmark different hashing functions and their performance
  • pure python functions, c functions, WASM functions
  • CDB’s djb_hash, crc32_hash, adler32_hash, SipHash-1–3 (or SipHash-2–4 depending on your python version) or even 32-bit part of md5
  • try different hashtable size multiplication factor (1x, 1.14x, 1.33x, or 2x number of its elements). That is load factor of 100%, 87.5%, 75%, and 50%.
  • try growing hashtable size to be in the form of 6n+1 or 6n-1 and coprime with 5 and 7 (if size is 60 make it either 61 or 65 or 67 or 71, and exclude 65 because it’s divisible by 5)
  • different ways to handle collisions: in case of collision, instead of step size=1 try to use first coprime of some hash that is ≤size (when using size to be 6n-1 it most likely it’s east to find the coprime)

Correctness

It’s very easy to get things wrong, make a faster file store, that fails to retrieve. For this reason, before talking about advantages of different approaches and trials, a trial must be correct. I made a script that generates a given number of random keys and values, I thought about making it spit them out as CSV or ND-JSON / JSONL (one json per-line) but I used the following format just because CDB tooling use it (like cdbmake and cdbdump), it’s very simple

key = "monkey"
val = "banana"
key_len = len(key)
val_len = len(val)
print(f"+{key_len},{val_len}:{key}->{val}")

To make sure I get high entropy I used the following

  • I used cryptographic-grade HMAC with UUIDv4 to generate keys
  • keys are values are base64
  • keys and values length is made to be 30–40 bytes for the keys and 16–1024 for the value
#! /bin/env python3
# gen.py [N]

import random
import hashlib
import hmac
import sys
import uuid
import base64

def rnd_key():
l = random.randint(10, 100)
salt = bytearray([random.randint(0,255) for i in range(l)])
key = base64.b64encode(hmac.HMAC(uuid.uuid4().bytes, salt, hashlib.sha256).digest())
return key[:random.randint(30, 44)].decode('utf-8')

def rnd_value(min_size=16, max_size=1024):
l = random.randint(min_size, max_size)
b = bytearray([random.randint(0,255) for i in range(l)])
return base64.b64encode(b)[:l].decode('utf-8')

def main():
n = int(sys.argv[1])
for i in range(n):
key = rnd_key()
val = rnd_value()
key_l = len(key)
val_l = len(val)
print(f"+{key_l},{val_l}:{key}->{val}")
print("")

if __name__=="__main__":
main()

I used that script to generate random 1k, 10k, and 100k records, 3 sets of each named A, B, C.

python3 ./gen.py 1000 > records_1k_A.src.txt
python3 ./gen.py 10000 > records_10k_A.src.txt
python3 ./gen.py 100000 > records_100k_A.src.txt

file sizes were ~0.5MB, ~5MB, and ~50MB respectively. And then I took a 1k random sample of each using sort by random and head

sort -R records_1k_A.src.txt > sample_1k_A.src.txt   
sort -R records_10k_A.src.txt | head -n 1000 > sample_10k_A.src.txt
sort -R records_100k_A.src.txt | head -n 1000 > sample_100k_A.src.txt

and we use the same random files for all of the tests. A basic test is something similar to cdbmake which takes the random records and store them into a db (named records_1k_A.cdb , records_10k_A.cdband records_100k_A.cdb) then make a script that reads the random sample line by line fetch the key from the CDB file and make sure it matches the key passed from the sample.

    for line in sys.stdin.buffer:
# +klen,dlen:key->data
line = line.strip()
if not line.startswith(b'+'): continue
counter+=1
key, val = parse_line(line)
val_db = reader.get(key)
if val != val_db:
print("** EE ** value mismatched for key=", key)

Undocumented Killing Feature

It worked but my files were slightly smaller than those produced by original tools and the number of collisions was large (and thus the redundant hobs, aka redundant probing), most of keys were not in the expected place and needed at least one hob. In the file having 256,000 keys on average you need to look in ~19 place before finding your key. Only ~half of the keys in their expected places and in rest cases you need to look in multiple places before you find it because of the collisions

first trial with hashtable size=number of items

the distribution was terrible, 0 hobs only covers 50% and the percentile 75 happens ~4 hobs. While 9 hobs barely covers 83% of the keys

256k file with hashtable factor=1x

What went wrong? Its seems that there is an undocumented feature in CDB that uses multiplication factor=2, that is the hashtable size=2x number of its items. (load factor of 50%, CDB hashtables are half used, half empty, to hold 100 items you need table capacity of 200)

# python-pure-cdb/cdblib/cdblib.py
def finalize(self):
'''Write the final hash tables to the output file, and write out its
index. The output file remains open upon return.'''
index = []
for tbl in self._unordered:
length = len(tbl) * 2 # <-------- HERE
ordered = [(0, 0)] * length

When I changed this simple multiplication factor, 75% of keys were in the first place you look for them (0 hobs) and ~90% of keys within 1 hob. and 2nd hob covers ~95% of the keys

256k file with hashtable factor=2x
with factor=2x

Using this simple trick made average hobs ~0 compared ~18 before, that is it’s expected that you find the key in the first place you look into. With such distribution it can deliver the promise of constant access time regardless the number of the database size.

After doing this change, the CDB file produced by my code matches not only the size but also the checksum of the one produced by the original tool.

And doubling the hashtable size did not double the file size. It barely made a dent in the file size. Because it only costs 8 bytes x number of keys, and most of the filesize is taken by the actual data of the keys and values not the 32-bit points to them. In the 256,000 keys the file size ~140MB file doubling the hashtable size was only 2MB.

  • 146,841,448 records_256k_x1.cdb
  • 148,889,448 records_256k_x2.cdb

First lesson Learnt

Unrelated to the on-disk database engine, I used to think that it’s good to pass hashtable capacity to avoid rehashes on each put operation

// Import the HashMap class
import java.util.HashMap;

public class Main {
public static void main(String[] args) {
// Create a HashMap object called capitalCities
HashMap<String, String> capitalCities = new HashMap<String, String>();

// Add keys and values (Country, City)
capitalCities.put("England", "London");
capitalCities.put("Germany", "Berlin");
capitalCities.put("Norway", "Oslo");
capitalCities.put("USA", "Washington DC");
System.out.println(capitalCities);
}
}

After this experiment, it seems that this is not enough, but I should pass twice as much and have load factor as small as 0.50 because JAVA defaults to 0.75 and Rust defaults to 0.875 both are large values (thus almost full hashtable thus higher chance of collisions)

NOTE: I know the default capacity is 16, so setting to to 8 is useless, but this is just a demo, think of an example of 100 item, you set capacity to 200.

// Import the HashMap class
import java.util.HashMap;

public class Main {
public static void main(String[] args) {
// HERE we passed initial capacity twice the elements and load factor of 0.5
HashMap<String, String> capitalCities = new HashMap<String, String>(8, 0.5);

// Add keys and values (Country, City)
capitalCities.put("England", "London");
capitalCities.put("Germany", "Berlin");
capitalCities.put("Norway", "Oslo");
capitalCities.put("USA", "Washington DC");
System.out.println(capitalCities);
}
}
256k item tested with with different load factors

What’s next

  • Trying different hash functions
  • Probing strategies
  • How to implement non-immutable data

--

--