AJ Pham
16 min readNov 20, 2018

Link image

Implement Key-Value Store by B-Tree

Introduction

The Key-Value store is the simplest of the NoSQL databases that are used in almost every system in the world. It can be as simple as a hash table and at the same time, it can also be a distributed storage system. And A Key-Value store is implemented by different data structures. As LevelDB uses LSM-Tree, Redis uses hashtable, InnoDB uses B+Tree,..

The blog is written by Tai Pham and Thuyen Phan.

In this blog, we’ll talk about Key-Value store service using BTree structure. Key-Value stores is also a topic of our mentor assigned to us during our internship at VNG.

Motivation

Implementing Key-Value Store by B-Tree is a challenge for us. But this will be a first step stone for database research. And we make this article aimed at sharing the learning process as well as being implementing the challenge.

What is a Key-Value Store?

A key-value database, also called a key-value store, is a type of nonrelational database that uses a simple key-value method to store data. Key-value databases have emerged as an alternative to many of the limitations of traditional relational databases, where data is structured in tables and the schema must be predefined.

Values are identified and accessed via a key, and store values can be binary, numbers, string, XML, JSON, HTML, image.

A key-value store offers several advantages like: flexible data modeling, high performance, high availability, operational simplicity, massive scalability.

There are multiple implementations of the key-value store that are used in production grade systems worldwide. Some example open-source implementations are Redis, CouchDB, MongoDB, and Cassandra (which uses a b-tree as the underlying data structure).

This is DB-Engines Ranking of Key-value Stores

The problem of storing data on memory

As we know, accessing data on memory is faster than hard disk. But memory is limited, data is not. In addition, saving data on memory is not safe in case the program crashes for any reason. So, we must store data under on hard disk when there is big data and data must be persisted.

Life is not like a dream, the use of disk for storage needs to solve disk access time problems. Therefore, we need to design a structure to store data under the file system so as to reduce the number of disk accesses.

Redis and memcache both choose the hashtable for this data structure. It’s a good choice because it is fast; with a good hash function, both read and write operations take place in O(1) time.

LevelDB and RocksDB both use the log-structured merge-tree to store its data. LevelDB and RocksDB implementations offer very good performance for writing, and good performance for reading.

And there are many data-store implementations that use the BTree as LMDB, BerkeleyDB. This blog post will show you how to implement a key-value store with B-Tree.

Why do use B-Tree to implement this?

Let’s take a quick talk at both the read and write data types of HDD (Hard Disk Drive) and SSD (Solid State Drive).

As you might have read, using traditional HDD for reading or writing is very low. When the data needs to be read or written, the actuator with an arm needs to go to a particular sector on the track to read or write data. This is measured as seek time. After that, the driver needs to rotate to reach a particular sector (rotational latency). This happens really fast but when we are reading huge amounts of data, it might become a bottleneck since disk has to continuously move to specific sectors. Average seeks times vary from 4ms for high end servers to 9ms for commonly used desktop drives.

SSD (Solid State Drive) offers superior performance because they have electrical connection for locating specific memory areas. But still, they also have average seek time of .10ms which might add up overtime for large data sets. Also, SSDs are expensive as compared to HDDs so not everyone is able to afford those.

Why do we talk to hard disk access? When we are accessing large data sets and doing frequent read or write from disk, we need consider this cost of read/write.

What’s B-Tree related here? Does using the B-Tree structure reduce this cost? Most of the operations of the B-Tree (search, insert, delete, max, min, …) require disk access with O(h) times. O(h) is the height of the B-Tree. B-Tree is a self-balancing search tree or fat tree. The height of B-Trees is kept low by putting maximum possible keys in a B-Tree node. Since the data is read from disk in the form of blocks. So, we have to design a B-Tree node size that is equal to disk block size. At the same time, the height of B-Tree is low, so the total disk access for most of the operations is reduced significantly. In addition, rebalancing of the tree occurs less often. And time complexity to search, insert and delete of B-Tree is O(logn), it will help increase performance.

Moreover, there is a data structure that can solve the problem of reducing disk access: B+Tree. The B+Tree is a modification of the b-tree that stores data only in leaf nodes, minimizing search cost in the common and worst case, and (optionally) links together all the leaf nodes in a linked list, optimizing ordered access. B+ trees are extensively used in both database and file systems because of the efficiency they provide to store and retrieve data from external memory.

About B-Tree

In computer science, a B-Tree is a self-balancing tree data structure that maintains sorted data and allows searches, sequential access, insertions, and deletions is logarithmic time. The B-tree is a generalization of a binary search tree in that a node can have more than two children.

Why have B-trees been invented? Actually, binary trees are faster because the number of comparisons you have to make to retrieve a value is also equal to the number of levels. If the binary-tree is balanced, this will be log2(nb values). In a B-tree, as we hold up to N values in each node, the number of nodes you will fetch is logN(nb values), but for each fetched node, we also have to do log2(n values in node) comparison per level.

B-Tree is less efficient than balanced binary-trees, that’s a fact. Why then should we use a B-Tree?. Because we don’t process data in memory only: we usually fetch it from disk.

Properties of B-Tree:All leaves are at same level.

  • A B-Tree is defined by the term minimum degree ‘t’. The value of t depends upon disk block size.
  • Every node except root must contain at least t-1 keys. Root may contain a minimum of 1 key.
  • All nodes (including root) may contain at most 2t-1 keys.
  • The number of children of a node is equal to the number of keys in it plus 1.
  • All keys of a node are sorted in increasing order. The child between two keys k1 and k2 contains all keys in the range from k1 and k2.
  • B-Tree grows and shrinks from the root which is unlike Binary Search Tree. Binary Search Trees grow downward and also shrink from downward.
  • Like other balanced Binary Search Trees, time complexity to search, insert and delete is O(Logn).

Some operating system knowledge

Memory Allocation
Simply memory allocation means: Reserving memory for specific purposes.

Programs and services are assigned with a specific memory as per their requirements when they are executed. Once the program has finished its operation, the memory is released and allocated to another program or merged within the primary memory.

Memory allocation has two types:
Static Memory Allocation:

  • The program is allocated memory at compile time.
  • Use partition stack on virtual memory.

Dynamic Memory Allocation:

  • The programs are allocated with memory at run time.
  • Use partition heap on virtual memory.

Memory Mapping
Between the break point and the base pointer is unallocated memory, but it may not be unused. This region can be memory mapped, which is the process of loading data directly from files into memory. You can directly memory map files, but often this is done automatically for you when you read and write files.

Another common use for the middle addresses is the loading of dynamic shared libraries.

Example:
When you make a call to a function like printf() or malloc(), the code for those functions exists in shared libraries, the standard C library, to be precise. Your program must run that code, but the operating system doesn’t want to have more code load into memory than needed. Instead, the O.S. loads shared libraries dynamically, and when it does so, it maps the necessary code into the middle address spaces.

Linux Thread Synchronization
Thread synchronization is defined as a mechanism which ensures that two or more concurrent processes or threads do not simultaneously execute some particular program segment known as critical section.
Process access to the critical section is controlled by using synchronization techniques. When one thread starts executing the critical section the other thread should wait until the first thread finishes. If this is not applied, it may cause a race condition where the variables may be unpredictable and vary depending on the timings of context switches of the processes or threads.

Readers — Writers Problem
Consider a situation where we have a file shared between many people.
If one of the people tries editing the file, no other person should be reading or writing at the same time, otherwise changes will not be visible to him/her.
However, if some person is reading the file, then others may read it at the same time.

Implement Key-Value Store by B-Tree

Overview
We use C++ programming language to implement this project. We approach it in two different ways. I would like to introduce the first way and also my implement.

The first way implements
In my project, I have applied some techniques like:

  • I built my project on a Linux OS environment.
  • I write class template for B-Tree structure, so i can adapt to more than one data type without repeating the entire code for each data type. But i will use string data for the program.
  • I will create a key-value store using Object-Oriented Programming.
  • I use the Binary Search Algorithmic.
  • Since i wanted to build a Key-Value Store service, i applied the server-client model. And I chose the thrift library to build server-client.
  • I also solve I/O problems with disk access and use of the file system.
  • And i have read some OS knowledge.

I will split a Key-Value Store Service into three parts to implement including B-Tree structure, Server-Client model and store data on disk.

The project implementation steps are as follows:
Firstly, I built the client-interface Key-Value store class with the basic methods that they can interact with it. This is also referred to as the API. The minimum API for a key-value store must include the methods Get(), Set(), Exist() and Delete().

On the server side of Key-Value Store service, I also built the same Key-Value Store interface. And B-Tree Store will inherit it. After that, I define template class data key-value as string, i call the key-value pair as an entry. Next, I built B-Tree structure with the key defined entry. And the node structure of B-Tree is as follows:

Make sure that i have carefully tested the B-Tree structure methods before
processing to the next step. The B-Tree structure you can refer to here.

In this section I will talk about how to store data on disk. First of all, I define the file structure for storing data on disk. The file structure database is as follows:

In order to store data of B-Tree on file, I have to change the B-Tree structure a bit like this:

I will explain the file structure as follows:

  • Header file includes two values:
    — Minimum degree ‘t’: 4byte.
    - Position root BTree: 4byte.
  • Node structure in file:
    - flag (1byte):
    - 0000 0000: the leaft.
    - 0000 0001: the node has a child.
    - 0000 0010: the node was deleted.
    - nCurr Key (4byte): the current number of keys of node.
    - array byte save positon of key: 4byte * (2*t — 1).
    - arry byte save positon of child node: 4byte * 2*t.
  • Entry structure in file:
    - flag (1byte):
    - 0000 0000: the entry was removed or edited.
    - 0000 0001: the entry was active.
    - lenkey (4byte): len key.
    - key (string any size).
    - lenValue (4byte): len value.
    - value (string any size).

How does it work?
When the server is started, the node root of B-Tree will read and kept on memory. Actions with the B-Tree Store as insert, remove, search still follows the B-Tree algorithm. But we will do it with position of entry and node on file. So, when the client adds a new entry to the B-Tree store, a new entry will be written at the end of the file. This is similar when writing a new node to the file. When the client wants to remove any entry, i will mark that node by changing the value of the flag at the beginning of the entry. This is also similar when removing a node in the case merging two nodes. When a client edits the value of an entry, I also mark that entry and write a new entry to the last file. When a client wants to get the value of any key, I will search algorithm of B-Tree so as to find position node in file where contains key. After i read the node and use binary search in order to find the position of key in file. Then i just read value at this position.

How do the entries or nodes that are marked remove or edit process? In my idea, these entries or nodes will be removed to file after a period of time by compaction file. And these entries or nodes can help us track user behavior.

The technique I use in this section is the basic method of accessing a file is via the read() and write() system calls. Before a file can be accessed, however, it must be opened via a open() or creat() call.Once done using the file, it should be closed using the system call close() .

At the end of implementing, I would like to introduce the server-client architecture.
The following is the image of architecture.

The client will connect to the server and call API. The server handle Api based on API that B-Tree Store provided.

I measured server performance with locust:

  • Test API Set Key-Value:
  • Test API Get Key-Value
  • Test API Remove Key-Value:

I have an idea that we can cache the position of entry or node to help increase the performance of the Key-Value Store. But we have to solve the problem of data consistency.

The above is my approach to implement Key-Value Store by B-Tree. It is not a good implement, because I cannot identify T (the value of t depends upon disk block size) of B-Tree in this approach. So, the performance of this service will not be heigh. But my partner will be better than one for this problem.

This is repo of the first way implement: https://github.com/phamtai97/key-value-store

The second way to implement
The second way to implement key-value-stores is simpler than the first way. In this way I also use B-Tree for data structure to store pair of key-value.
Almost Key Value Database must have to some method like: set, get, exist, remove and so on, so I do. In my mini project, the key value store has some methods for accessing databases like get, set, remove, exist.

Each database item is stored as a key, or attribute name, and is associated with a value. This can work very well for unstructured data as the database does not require a set schema across key-value pairs. This can scale very well and have high performance due to the simplicity of design and the fact that only the key is of interest to the database.

The first thing we need to know to implement the key-value-store is about system programming. Basic knowledge about system call, file I/O and memory management was needed.

I am starting this project in a fresher course at VNG Corp. After this project I’ve gained a lot of experience and valuable knowledge of hardcore back-end engineering, but I know that I am going to learn a lot of new things too. This project allows me to review:

  • The C++ programming language
  • Algorithmics and data structures
  • Memory management
  • Concurrency control with multi-threading
  • Networking with a server/client model
  • I/O problems with disk access and use of the file system
  • Some problems about concurrency control: producer-consumer problem, read-write problem.
  • A little bit about preprocessors in C programming.
  • Implementing key-value-stores using B-Tree, I read some eBook or blog about some algorithms in B-Tree, to understand clearly, I recommend you read a book named ‘Introduction to Algorithms’ by Thomas H. Cormen.

Implement B-Tree

BTree has some important fields like:

  • next_pos: mean next position in file, new node created will be here.
  • root_pos: mean that position belongs to root node.
  • degree: the goal of a b-tree is to minimize the number of disk accesses.
  • degree depends upon disk block size.
  • child: list of BTreeNode contains child of current node
  • num: number of key stores in node, number of child equal num plus one
  • child_pos: very important field, this array contains the position of child — node in file.
  • pos: position of current node
  • leaf: identify node was leaf or not

How it works
When you modify a node, using some method like: set or remove. Any node effect will be written to disk. And one more thing, when restart server, B-Tree will rebuild depending upon nodes.dat file and tree.dat file, read all available nodes to memory. In this way, we need time to rebuild trees, but some access like get, exist will be faster. But this way is sometimes bad, because you don’t have enough space in memory, so the structure of B-Tree in this way is used when you don’t need to store a huge amount of data, and you want to faster when search or check exist.
I think the above way is so bad, so I decided to implement another version:

This looks like the older version, but that fields BTreeNode child[2t] was removed. This means we don’t have any tree structure in memory. Instead of using pos in every node identify offset data node in file, when we want to access this node, we use pos and read node from file in exact pos we have. At each node we have fields call childpos[2t], this fields contain 2t value integer to identify child position in file.

So, when we restart server, we don't need to rebuild tree structure, we just read tree metadata file, it contains root position and next position. We use this node to figure out all the nodes we need.

This is repo of the second way implement: https://github.com/billhcmus/key-value-store

Benchmark

You can refer to here

Issues surrounding the Key-Value store by B-Tree

Write amplification (WA) is an undesirable phenomenon associated with flash memory and solid-state drives (SSDs) where the actual amount of information physically written to the storage media is a multiple of the logical amount intended to be written. The NAND flash memory must be erased before it can store new data. In other words, data cannot be overwritten directly as it is in a hard disk drive. You can see more here and here.

Since we have to design a B-Tree node size that is equal to disk block size. So if we store data on SSD, we will have problem with “write amplification”. But LSM solved this problem. You can see more here

Conclusion

We are done with this blog post. Through this post, we would like to share with you about implements Key-Value Store by B-Tree. Besides, there is some new knowledge related to databases that we learn in the process. At the end, we were very happy and would like to thank you our mentor anhld2 and tannt3.

Reference

[1] http://basho.com/resources/key-value-databases/
[2] http://blog.justinsb.com/blog/2013/12/08/cloudata-day-2/
[3] https://en.wikipedia.org/wiki/Hard_disk_drive_performance_characteristics
[4] https://loveforprogramming.quora.com/Memory-locality-the-magic-of-B-Trees
[5] https://www.geeksforgeeks.org/b-tree-set-1-introduction-2/
[6] https://www.quora.com/In-filesystems-what-is-the-advantage-of-using-b-trees-or-b+trees-as-opposed-to-using-inode-indexes
[7] https://en.wikipedia.org/wiki/B-tree
[8] https://blogs.apache.org/directory/entry/let-s-meet-b-trees
[9] https://vnreview.vn/tu-van-may-tinh/-/view_content/content/445083/co-ban-ve-luu-tru-so-phan-4-o-luu-tru-ssd
[10] https://www.techspot.com/news/54107-understanding-ssds-why-ssds-hate-write-amplification.html
[11] http://nisl.wayne.edu/Papers/Tech/wacost_tos.pdf
[12] https://www.quora.com/How-does-the-log-structured-merge-tree-work

AJ Pham

I am a Senior Software Engineer at VNG. I often write blogs to share experiences as well as read more blog posts. linkedin.com/in/aj-pham-abc27011997