How I built a key value store in Go
Databases (or DBs) have played a very important part in the recent evolution of computers. At the time, computers were basically giant calculators and data (names, phone numbers) was considered the leftovers of processing information. Computers were just starting to become commercially available, and when business people started using them for real-world purposes, this leftover data suddenly became important.
In this article, we would go through the process of understanding the problem and deriving the corresponding solution in Go.
In a key value database, all access to the database is done using a primary key. Typically, there is no fixed schema or data model. The key can be identified by using a random lump of data. Key-value stores “are not” useful when there are complex relationships between data elements or when data needs to be queried by other than the primary key.
An element can be any single “named” unit of stored data that might, or might not, contain other data components.
Advantages of key value stores over RDBMS:
- Flexible data modeling: Because a key-value store does not enforce any structure on the data, it offers tremendous flexibility for modeling data to match the requirements of the application.
- High performance: Key-value architecture can be more performant than relational databases in many scenarios because there is no need to perform lock, join, union, or other operations when working with objects. Unlike traditional relational databases, a key-value store doesn’t need to search through columns or tables to find an object. Knowing the key will enable very fast location of an object.
- Massive scalability: Most key-value databases make it easy to scale out on demand using commodity hardware. They can grow to virtually any scale without significant redesign of the database.
- High availability: Key-value databases may make it easier and less complex to provide high availability than can be achieved with relational database. Some key-value databases use a masterless, distributed architecture that eliminates single points of failure to maximize resiliency.
- Operational simplicity: Some key-value databases are specifically designed to simplify operations by ensuring that it is as easy as possible to add and remove capacity as needed and that any hardware or network failures within the environment do not create downtime.
Some scenarios in which a key value store makes more sense:
- NoSQL database
- Storing Entire Blockchain data
- Cache Replacement
- Inside Recommendation Engines
- Storing very large object graphs
- Log storage
- User Profile data
- Inside a url shortener
Some of the most popular key value stores currently are BoltDB, Badger, LevelDB, etc.
I was building a blockchain protocol last year and while designing the system, i saw that almost all the existing blockchain clients of different protocols were using a key-value store for their storage needs. Apart from being used in blockchain, the thing which got me excited about building a database was the complexity of the problem which can only be solved by using proper data structures and algorithms, and I got very few chances of working on a production level system software.
The problem — Optimizing Disk Seeks
The seek time is the time it takes a specific part of a hardware’s mechanics to locate a particular piece of information on a storage device. This value is typically expressed in milliseconds (ms), where a smaller value indicates a faster seek time.
What seek time is not is the total amount of time it takes to copy a file to another hard drive, download data from the internet, burn something to a disc, etc.
Seek time is often called access time, but in reality, the access time is a bit longer than the seek time because there exists a small latency period between finding data and then actually accessing it.
So, in order for a database to search for particular data, we need to minimize the number of seeks as well as the time.
The Solution — Btree
We need to store data in such a way that we can speed up the writing and reading process on disk. Before diving into the solution, let us think about some random ideas which come to our mind. Here are a few of them:
- Comma Separated Records
- Storing everything in multiple JSON files
- XML based storage
- Serializing data objects to disk
We can use any method for data storage, but explaining the pros and cons of every method listed above is beyond the scope of this article. Let us move straight towards our solution.
A B-tree is a method of placing and locating files (called records or keys) in a database. The B-tree algorithm minimizes the number of times a medium must be accessed to locate the desired record, thereby speeding up the process.
B-trees are preferred when decision points, called nodes, are on hard disk rather than in random-access memory (RAM). It takes thousands of times longer to access a data element from hard disk as compared with accessing it from RAM, because a disk drive has mechanical parts, which read and write data far more slowly than purely electronic media. B-trees save time by using nodes with many branches (called children), compared with binary trees, in which each node has only two children. When there are many children per node, a record can be found by passing through fewer nodes than if there are two children per node.
Sectors and Blocks of a hard disk
A sector is a physical spot on a formatted disk that holds information. When a disk is formatted, tracks are defined (concentric rings from inside to the outside of the disk platter). Each track is divided into a slice, which is a sector. On hard drives and floppies, each sector can hold 512 bytes of data.
A block, on the other hand, is a group of sectors that the operating system can address (point to). A block might be one sector, or it might be several sectors (2,4,8, or even 16). The bigger the drive, the more sectors that a block will hold.
Blocks as BTree nodes
If we map each Btree node to a block, we would be able to convert our in memory Btree to a disk based one. Let us first design a struct for our Btree node.
// DiskNode - In memory node implementationtype DiskNode struct {keys []*pairschildrenBlockIDs []uint64blockID uint64blockService *blockService}
Now we need a block struct which would represent our actual data on disk
// diskBlock -- Make sure that it is accomodated in blockSize = 4096type diskBlock struct {id uint64 // 4096 - 8 = 4088currentLeafSize uint64 // 4088 - 8 = 4080currentChildrenSize uint64 // 4080 - 8 = 4072childrenBlockIds []uint64 // 352 - (8 * 30) = 112dataSet []*pairs // 4072 - (124 * 30) = 352}
Manipulating Bytes with Go
Since we are dealing with the file system, we need to read and write data in bytes, so this poses another problem: Converting a block struct to bytes and vice versa. Let us use the below function to get back bytes from a block
func (bs *blockService) getBufferFromBlock(block *diskBlock) []byte {blockBuffer := make([]byte, blockSize)
blockOffset := 0//Write Block index
copy(blockBuffer[blockOffset:], uint64ToBytes(block.id))
blockOffset += 8
copy(blockBuffer[blockOffset:], uint64ToBytes(block.currentLeafSize))
blockOffset += 8
copy(blockBuffer[blockOffset:], uint64ToBytes(block.currentChildrenSize))
blockOffset += 8//Write actual pairs now
for i := 0; i < int(block.currentLeafSize); i++ {
copy(blockBuffer[blockOffset:], convertPairsToBytes(block.dataSet[i]))
blockOffset += pairSize
}// Read children block indexes
for i := 0; i < int(block.currentChildrenSize); i++ {
copy(blockBuffer[blockOffset:], uint64ToBytes(block.childrenBlockIds[i]))
blockOffset += 8
}
return blockBuffer
}
Similarly, let us generate a block struct if we have read a Byte array of size 4KB from disk:
func (bs *blockService) getBlockFromBuffer(blockBuffer []byte) *diskBlock {blockOffset := 0
block := &diskBlock{}//Read Block indexblock.id = uint64FromBytes(blockBuffer[blockOffset:])blockOffset += 8block.currentLeafSize = uint64FromBytes(blockBuffer[blockOffset:])blockOffset += 8block.currentChildrenSize = uint64FromBytes(blockBuffer[blockOffset:])blockOffset += 8//Read actual pairs nowblock.dataSet = make([]*pairs, block.currentLeafSize)for i := 0; i < int(block.currentLeafSize); i++ {block.dataSet[i] = convertBytesToPair(blockBuffer[blockOffset:])blockOffset += pairSize}// Read children block indexesblock.childrenBlockIds = make([]uint64, block.currentChildrenSize)for i := 0; i < int(block.currentChildrenSize); i++ {block.childrenBlockIds[i] = uint64FromBytes(blockBuffer[blockOffset:])blockOffset += 8}return block}
We have seen how we are able to read bytes from the hard disk and map them to our block struct, but wait, how does this block fit into our Btree, our Btree only understands Nodes, so now we need a way to convert our block struct into a DiskNode Struct, let us see how we can do that:
func (bs *blockService) convertBlockToDiskNode(block *diskBlock) *DiskNode {node := &DiskNode{}node.blockService = bsnode.blockID = block.idnode.keys = make([]*pairs, block.currentLeafSize)for index := range node.keys {node.keys[index] = block.dataSet[index]}node.childrenBlockIDs = make([]uint64, block.currentChildrenSize)for index := range node.childrenBlockIDs {node.childrenBlockIDs[index] = block.childrenBlockIds[index]}return node}
For understanding everything in detail, please go through the github repository.
We are so obsessed with using libraries and frameworks that we do not try to understand the problem and think of a solution first, instead, we straight away start a quest for finding the next cool framework or library on the market, let us try to dig in more about how our world works by building your own versions of your favorite tools and components.
There is no harm in using libraries and there is no need to re invent the wheel, yet the best way to learn and understand something is to build it on your own, and a language like Go makes it even more enjoyable.