Why you should build your own NoSQL database

9 min readSep 12, 2016

> DISCLAIMER: This article was created several years ago, it’s not the most up to date reference for such content

If you code for the web, you probably work with databases, which leads you to integrate, setup and most importantly understand how a database works.

Let us say at some point your project starts to draw attention, many users connecting and consuming data at the same time. You will have to scale!

One of the secrets behind a scalable architecture for a web application relies on knowing which database to use as well when and how to use them. For such, you must understand them.

In order to acquire such knowledge of the challenges and paradigms behind a project, it is interesting to look at the code and make sense of it.

Most of the times this process of diving into the code can be a demystifying adventure. The experience can provide you with a clear picture of its functionalities.

Therefore, always remember:

I hear and I forget. I see and I remember. I do and I understand — Confucius

By creating your own, even if it is just for fun, you can understand the concepts and paradigms behind a project like so.

Found it interesting? Keep scrolling!

What is a NoSQL database?

Before anything else, let us refresh our memory with some concepts

Relational or SQL databases emerged in the 80’s and they are still largely used.

By establishing a common query language, providing persistence, reporting, and support to transactions, relational databases grounded their success and became a reliable base for applications.

It is important to highlight that this success is strictly related to the application’s requirements at the time.

However, since its conception, relational databases had several problems, and maybe the most important one was a hard time on mapping real-world entities to a structured form.

In the 90s object-oriented databases unsuccessfully attempted to solve this mapping problem by storing entire objects. The failure is usually associated with the fact that at the time relational databases were used as an integration interface between different applications.

If you have ever worked with multiple applications connecting to a single database you probably know that the effort to change the integration database is almost the same as a complete rewrite of the applications.

Later on, in the late 2000s the exponential growth of the internet had a direct effect on the requirements for web applications, pointing design flaws on current architectures. That drove big players, companies like Google and Amazon, to come up with their own solutions towards solving scalability issues. As an example: Bigtable and Dynamo emerged at that time.

Those solutions had several common factors, they were: non-relational, cluster-friendly, schema-less, and most of them open-source.

The foundation of what is known today by NoSQL relies upon those initiatives. But the term “NoSQL” appeared for the first time around 2009/2010, it was supposed to be more of a joke than anything else. The fact that it became such a buzzword to identify those modern kinds of databases was purely accidental because it is supposed to mean “not only SQL” and not “no-sql”.

Unlike the usual SQL databases, where there is support for a structured relational schema to store data, NoSQL databases have no relational support for storing data what so ever. Although you may emulate this behavior, let us leave that for another time.

The catch here is to understand that scalability for those modern databases relies upon distribution. It is common sense that relational databases were not initially designed to be distributed, they were not able to handle consistency on cluster level, they were supposed to be vertically scaled only. Meaning that if you want to increase your demand you would increase the server's processing power or add more memory.

Although that was not much of an option for companies like Google and Amazon, mostly, because a single database server could not handle Google’s billions of requests. The solution to scale those applications was based on changing the disposition of the servers, instead of increasing memory and processors, they started creating small and distributed clusters, what we know today as horizontally scaling.

Is currently known that horizontally scaling has several benefits, it makes possible to easily increase or decrease the number of servers over demand, which has a huge impact on the cost of hosting such applications.

Imagine Netflix for instance, let us say that they have usually 100.000 people online simultaneously on a regular day, but when a new season is released that number can reach the millions.

Maybe they would be able to come up with a huge server and keep it running all the time, but how much would it cost to maintain such an infrastructure? By horizontally scaling, they are able to grow the number of servers on demand in order to handle this temporary flood, and then decrease again to normal bases after the massive demand. They actually do this several times during the day.

Since horizontal scaling relies upon several small units/servers, it is crucial to understand some concepts of distributed systems, like the CAP theorem.

CAP

Eric Brewer, a well-known scientist, once said it is impossible for a distributed system to simultaneously support the following three guarantees:

Consistency — all servers see the same data at the same time
Availability — every request receives a response whether it succeeded or failed
Partition tolerance — the system continues to operate despite arbitrary partitioning due to network failures

Usually, people will say that you have to choose two of the aspects, in the real world is not a binary option because databases can focus on consistency over availability for some operations and the other way around for others.

Even so, it is possible to divide known databases by their focus on two of the three guarantees, calling them CA, CP or AP.

Note that this is not a binary rule, it is more of a way to understand the focus.

CA — Consistency & Availability — MySQL, PostgreSQL

Those databases will always ensure consistency, the data will always be reliable, as well as you can be sure to have a response for every request.

However, since they do not consider partition failures they will not perform well on clusters. It is also interesting to consider that they will choose consistency over response time, that can be something you are not willing to afford.

CP — Consistency & Partition Tolerance — MongoDB, Redis, MemcacheDB, HBase

Such databases will ensure consistency and partition tolerance over availability. Meaning that it is important to keep the data consistent between all nodes, if for some reason a node is not available the system will not operate.

Again, the response time can be directly affected by the consistency check between all nodes.

AP — Availability & Partition Tolerance — Cassandra, CouchDB, Voldemort, Riak

Finally, this kind of database focus on availability, they will always be operating and responding to all the requests. To afford that they may ignore the consistency between the nodes, if a node is not reachable the other nodes will skip the consistency check and perform the operation.

Remember:

Only use SQL databases when you need consistent transactions!

Data models

The different kinds of NoSQL

In order to distinguish the purpose of use, we can also divide NoSQL databases by the way they handle data. Usually, either they have a key-value, document or graph approach.

Key-Value

This article focuses on key-value databases. A key-value database provides only an identifier, the key, and the value when handling data. You may think of it as a big shared hash table or dictionary.

Given its simplicity, key-value stores can be considered the most primitive type of NoSQL database.

Showtime

As mentioned before, to understand how they work, let us create one.

The idea is to start from the scratch and make it simple, check out the basic specification of the project:

The conceptual schema has several boxes, with an abstract representation of the components of the system. Those being:

Server

The enclosure of internal abstractions. Usually, the server is the whole project, it is what is mentioned when someone refers to a database.

Storage

Every database must have a storage abstraction, a way to store data for further use. For this project we are going to use a simple in-memory strategy, which means that the content will be stored only in the volatile memory, no backup or filesystem support is provided.

Also, it is important at this stage to define the structure of our key value storage, for example purposes both the key and the value are going to be Strings.

Communication

A database has no use without an interface to provide inter-process communication support. Allowing other processes or even other machines to be able to exchange messages with the database is crucial for the project purpose.

Several different kinds of interfaces or transports could be used in order to accomplish inter-processes communication. e.g: Unix Socket, TCP Socket, HTTP or even WebSocket and such.

For simplicity purposes let us use a TCP server that accepts simple commands and return responses.

Core

The core is the code that orchestrates the incoming and outcoming connections and messages and performs storage actions.

Regarding the API, let us follow a simple message exchange format:

command key value

command —an action to be performed (required)
key — a target for the action to perform (required)
value —a value when necessary (required for the set command only)

Available commands:
set — insert or override a key with the given value
get — retrieve the content of the key

The set command needs a key, foo, and the value, bar.

e.g.:

set foo bar

The server will return the value of the inserted key. As for the get command, it only needs the key, foo.

e.g:

get foo

Again, the server will return the value of the key “foo”, in this case “bar”.

Programming Language

The programming language should not be a limitation to follow the article, mostly because pretty much every modern language has support for TCP and Hashes.

In order to keep it simple, the chosen language for this project was Crystal, a new and powerful compiled language with a syntax inspired by Ruby.

With a simple and elegant syntax the code pretty much explains itself, but let us go over it. Take a look at the code:

Note that at this stage the code it is not the goal, yet a tool to guide you in understanding the concepts. For now, the code may look horrible regarding basic object-oriented principles, it does have wrong and premature assumptions and severals bugs. Do not pay attention to that just yet.

That said, let us analyze the code.

The first 2 lines define the TCP server, our communication interface, as well a Hash with key and values of Strings, our storage abstraction.

Later on, the loop and spawn statements surround the code, in order to make it ‘concurrent’. The code eternally seeks for new TCP connections, when a new connection is available it spawns to a new thread* in order to not block the execution of the program for other possible connections.

(*) Crystal does not support threads, yet Fibers, for simplicity of the article let us consider as a thread.

Soon enough inside the spawned “thread”, once again a loop and another spawn surround the ‘request’, in order to provide support for concurrent requests coming from the same connection, or the same client.

The request is processed, any received package will be split on the space character, given origin to a command, a key, and a value if available.

Soon enough the request is evaluated with 3 possible scenarios: the given command is a get, a set or it does not exist (the server returns an error message).

Finally, memory operations will be performed as demanded, as well as the proper response message for each case.

Testing

In order to test the database, you can open a TCP connection with telnet.

telnet 127.0.0.1 5000

Type a message, hit enter and wait for the response:

>set foo bar
bar
>get foo
bar
>invalid
error: "invalid" is not a valid command

Yay, it is working!

Final notes

Creating a simple database makes it easy to understand some basics of the problems faced by projects like it. Now I dare you to go beyond, to choose and work the code towards ensuring either consistency or availability.

The code was the first scratch of a recent project that I have been working on, it is called BoJack and its source is available on GitHub.

If you have enjoyed the article, and I hope you did, I recommend diving into BoJack’s code in order to have an idea of how it is progressing towards the mentioned challenges of a NoSQL database.

For more articles like this, please recommend and share this one.

Inspired by