DB4: hengfeiyang/lsmdb

Arjun Sunil Kumar
Distributed Systems Engineering
8 min readJun 11, 2022
https://segmentfault.com/a/1190000041198407/en

NOTE: Before starting, the codebase doesn't have Level-based/Size-based SST compaction. So, the SST files reside in Level 0. But, this would anyways be a good starter for learning WAL, Memtable, ImmutableMemtable, and SSTable. It also doesn’t have Bloom Filters.

These days, I am reading more database codes, than writing any of them. I have been reading codes in Java, Go, and Rust. I started my journey by reading java code for SimpleDB (a traditional RDBMS), then tokio mini-redis written in rust, RheaKV in Java again (this was similar to TiDB, TiKV based on RAFT), Raft-KV in Go, and KeyDB in Go. I did find similarities in the patterns used, but yes so far, these were all toy projects to get myself clear on the DB internal. There is a lot left to understand about a production-ready database.

It took me a while to juggle between languages a figure out what I should learn to be in this domain. My best bet was on RUST, but honestly, It requires a lot of time investment to get myself to master RUST. And code is usually verbose and I was not really liking it. Then the next bet was either Go or Scala. I find Go to be more aligned with the type of work that I would be interested in future, ie more into Distributed Systems (Go) side of things, rather than the Data Engineering side(Scala).

On a side note, TiDB and TiKV architecture has greatly influenced my pursuit of DB specialization. I really like the modern tech stack (Go and Rust) they have chosen for the DB. They have great learning materials with immense potential if translated to English. Hope they make their contents globally accessible,

There is a lot to learn in Java itself. I am ashamed to say, even after spending 4 years working on Java, I still can’t say I have mastered the language. If I could undo my past, I would definitely read netty internal architecture, focus more on the distributed systems side of things than Data Engineering and try Open source contributions in a Java DB. But, enough of my pity talk, let's start with the dissection of a new DB here.

Project Structure

I liked the project structure laid out for this one.

cmd/lsmdb/main.go

internal/server/server.go

We are passing the Gin engine to the router.Route(). We are passing it by reference, so that, the router can add the routes defined in that class. Then we invoke Run() on the Gin engine.

internal/router/router.go

The API’s

internal/service/get.go

internal/service/set.go

internal/service/flush.go

internal/service/bulk.go

internal/pkg/lsm

This is the crux of our database. Here we will be looking into a small LSM tree implementation. Below is the SET and GET data flow.

Also, a snip from a translated article.

https://www.skyzh.dev/posts/articles/2021-08-07-lsm-kv-separation-overview/

internal/pkg/lsm/server.go

  • Here in all the helper functions, we are calling the functions from the memory_table.go
  • Compaction is not yet implemented thought.

internal/pkg/lsm/memory_table.go

This is the main file, which has the GET, SET, LoadFromDiskTable, LoadFromWAL, etc.

  • When the activeTable is full, we move the contents to immutableTable and the set the activeTable as new SSTable ready to receive fresh writes.
  • Data Structure definition
  • Usage of NewMEMTable in server.go
  • Alter commands ie SET/Delete
  • Here you can see that, we lock the mem_table, and then check if the size is ≥ blockKeyNum, if it is, then we do the switch.
  • If the restore is false, ie the normal write, we first append it to the WAL.
  • Then we do a lock again, call the activeTable.set(command) API and call the unlock.
  • The public Query function checks if it already Deleted an entry. If so, it throws the error NotExists.
  • Essentially, we delete items by setting a delete flag or setting the Command to CommandTypeDelete (soft delete), instead of removing the entry from the file (hard delete). We do level compaction, we ensure that the marked for deletion items are not included in the merged SST.

The actual query follows this path

query

→ memory SSTable →active SSTable

→immutable SSTables

→sparse Index → SST file

Now we are going to enter the meat of this memory_table, the Flush, LoadFromDisk & LoadFromWAL.

Let's see how the File byte layout looks like for SSTable (ie sdb file)

Setting the metadata partial data

Covered the last 3 in the layout.

We will do deviation here, read about Sparse Index, and come back to this later.

Sparse Index

Just watched some short videos (video 1, video 2), and here is the gist

https://www.youtube.com/watch?v=DgIrXSclo2k
http://mlwiki.org/index.php/Sparse_Index

So now coming back.

Here is the pseudocode of what is happening in the Flush()

Excerpt from an article

https://segmentfault.com/a/1190000041198407/en
https://segmentfault.com/a/1190000041198407/en
https://segmentfault.com/a/1190000041198407/en

From here on let's deviate for a while before coming back to LoadFromDiskTable() & LoadFromWAL().

Let’s now focus on the DTOs that will go inside the disk.

spase_index.go

metadata.go

sparse_index.go

I’m not sure how TableName is not written into the file. May be we read it from the file name or so.

This is how the data folder looks like:

Let’s see the Datastructures in here.

wal.go

https://segmentfault.com/a/1190000041198407/en

NOTE: The “Google: Big Table” paper on creating a CP database. They said that they can ensure higher availability(A in CAP theorem) by doing redundant replication(Raft or Paxos), and improving the Quality of Hardware. And these days, most data centers have more specialized hardware than commodity hardware.

sstable.go

Here we are using Skip List library.

https://en.wikipedia.org/wiki/Skip_list

What happens in Bytes() should be clear by now.

We have in a sense covered all the files except sstable_disk.go. We will cover it later. Now lets go back to the memory_table.go and finish the rest of the functions.

memory_table.go [Cont.]

From the above code, we now have pretty go idea what Restore() does within the /disk files.

Restore is like, restoring the state of that object by reading the byte[].

Now, we are left with just sstable_disk.go

sstable_disk.go

Usage of NewDiskSSTable()

Code:

Here you can see, we are keeping a Map<FileName, DiskSSTable> for faster disk access by application caching.

The DiskSSTable contains an array of SSTable (ie Blocks), in that file. A quick reminder, SSTable is nothing but a wrapper around SkipList.

If the block is not in memory, load it and do the Query on that SSTable.

Load is fairly simple as well.

Since Compaction is not implemented, we are not doing any garbage collection (removal of deleted key-value entries) here.

https://www.skyzh.dev/posts/articles/2021-08-07-lsm-kv-separation-overview/

Conclusion

We have covered the complete lsm-DB. This was a great codebase to learn about Go structuring and also a beginner introduction to Memtable, WAL, LSM tree, etc. I made some minor modifications to the lsm-DB on my personal fork. Feel free to check that out as well. That’s it for today. Have a great weekend everyone!

--

--

Arjun Sunil Kumar
Distributed Systems Engineering

Writes on Database Kernel, Distributed Systems, Cloud Technology, Data Engineering & SDE Paradigm. github.com/arjunsk