Everything is a string

Designing in-memory databases with sdb


Recently I have been toying with my own key-value database, named SDB.

SDB stands for String DataBase and it can be described as a “strings-only memcache with disk storage”. This may sound like reinventing the wheel, and it is indeed.. but it spawns some new possibilities and ideas for data storage.

The public API performs some actions in a memory hashtable which can be synced to disk atomically at any time. It is also possible to instantiate children databases refering them by a string. That’s called “namespaces”.

Data Types

In SDB everything is stored as a null terminated string, but it provides proper conversions for the following types:

  • numbers (unsigned 64bit integers)
  • lists (comma separated)
  • strings (base64 encoded)
  • json (get/set json fields)
  • binary (base64 encoding)

You may add even more custom data types… and all them will still be serializables into a simple string.

That’s why I decided to build SDB on top of a single data type: the string.

Serialization

Most key value databases use raw byte arrays as root data type, instead of a null-terminated string. But you end up having to de-serialize the contents into something usable by the application, which end ups in complex, non-portable or repeating the same code all the time.

SDB standing on top of strings ensures the data will be always human readable and printable, which simplifies application/database debugging a lot.

The API provides optimal serialization primitives for converting numbers (dec/hex), boolean, and other types .. from/to strings. Also, strings are encoded in base64 in order to avoid charset issues or using special chars like ‘,’ or ‘;’ delimiters which are used by SdbQuery.

Weak pointers

One of the goods of running an in-process database is that we can use it to keep track of pointers in memory, so we can mix stable and unstable data on the same database. Those weak pointer keys shouldn’t be synced to disk, because they loose the meaning when used on another process instance.

In order to avoid syncing them to disk you can do the following:

  • Use a separate Sdb instance without specifying any filename.
  • Never call sdb_sync() on the given database
  • Specify an expiration time for those keys (temporary keys do not sync)

Bear in mind that SDB permits serializing C structs into a strings and viceversa, in that case, you should take care and skip all the memory pointers referenced on those structs to avoid having weak data stored in the database.

Data structures

At this point you may probably noticed that SDB is implemented on top of a hashtable with an internal linked list. This is simple enough to cover all the needs and allows me to keep the code short and smart.

Most applications require more complex data access, which can’t be solved by simple O(1) hashtable access or a slow and wrong O(n) linked list iteration.

Here’s where higher level data structures join the game, they can extend the possibilities of a very simple database in order to provide higher speeds for acessing the information.

Implementation

As an example, I have already implemented some basic data structures on top of a key value storage in Vala, which is named ‘SdbTypes’.

https://github.com/radare/sdb/tree/master/bindings/vala/types

We can implement an AST, btree, hashtable, ranged indexes and more using those simple concepts and staying away from messing with pointers, memory leaks and other common issues when doing hand made data structures.

I will discuss other data types with proper benchmarks and comparisons in the future, but for now as an introduction I will show you how to implement a single linked list on top of a key-value data storage.

Single Linked List

As long as we don’t follow a strict schema we will need to define the requirements for the instances of each element, so we will easily spot the key name and how to walk the contents.

  • Name of the data structure (slist)
  • Each root instance must be indexed by name
  • Each element but last contain a pointer to the next element
  • Contained data will be a string, but we can deseriealize that content
  • We want to get the length of the list quickly

Following those rules we can save the linked list as key-values:

slist.<name>=<length>
slist.<name>.last=<index>
slist.<name>.first=<index>
slist.<name>.<index>=<data>
slist.<name>.<index>.next=<data>

Then we will have to define some functions in the application side to handle this structure and let the “complex” part away from the users. In Javascript:

  • add
  • del
  • forEach

This is the source of an SList implementation using the NodeJS bindings.

Source of slist.js for NodeJS

Running the above code we get the following output:

Output of slist.js

Final notes

Implementing data structures on top of a hashtable can result tedious and may look slow, compared to native data structures. But it brings some benefits that makes it an interesting option to choose for your next application.

  • Implicit disk storage (load/save)
  • Data serialization (ut64, binary, array, C structs, boolean, ..)
  • Dynamic Typing (everything casts to a string)
  • Ease refactoring (no schemas)
  • Ease debugging (interactive query interface)
  • Distributed capabilities (network)
  • Temporary keys (expiration time)
  • Access from bindings (Python, NodeJS, NewLisp, Vala, …)

Having only one base type, the string; simplifies the logic of the database and the API provides a set of functions to convert various native types from/to a string.


—pancake