Have you ever wondered what databases were built on?

Published in

The Startup

5 min readDec 12, 2019

Do we know what stands beneath the tools we’re using every day?

Here is a database picture as blog post often start with a big nice picture

As software developers, we use databases in every day (especially when we are backend engineers). We have learned to use them, to query them and how to optimise our queries. But what do we really know about databases? Do we know which models were used to built them? And how we should choose one?

I will come back in this post on a part of their history, why NoSQL became popular, finally, I’ll highlight a database model (ACID) and a distributed database theorem (CAP theorem) that help differentiate/choose above all databases.

A bit of history

The development of database technology can be divided into three eras based on data model or structure: navigational, SQL/relational, and post-relational.

Early navigational data models (hierarchical model and the CODASYL) won’t be discussed here as it is no longer used today.

Then appeared the relational model, first proposed in 1970 by insisting that applications should search for data by content, rather than by following links.

Finally, the next generation of post-relational databases in the late 2000s became known as NoSQL databases, introducing fast key-value stores and document-oriented databases. Those databases were pushed by the apparition of web-based application causing spikes of write/read for which RDBMS were not designed (RDBMS databases can still be used but with caching to solve this issue).

What is NoSQL (Not only SQL)?

A NoSQL database provides a mechanism for storage and retrieval of data that is shaped in means other than the tabular relations used in relational databases. Such databases have existed since the late 1960s, but did not obtain the “NoSQL” moniker until a surge of popularity in the early twenty-first century, triggered by the needs of Web companies such as Facebook (Cassandra), Google (BigTable), and Amazon(DynamoDB). NoSQL databases are increasingly used in big data and real-time web applications.

No Concept of JOIN. NoSQL databases don’t provide all ACID concepts (see below).

Why using NoSQL?

It is not RDBMS and it is important to avoid the domination of a single model on the market.
Unstructured data as relational database model may not be the best solution for all situations.
NoSQL databases are often cheaper when it comes to scaling
The velocity of development is higher with NoSQL (often using JSON), at least at the beginning. If it is easier to get started, people will tend to use it more!
Deliver highly responsive experiences to a globally distributed base of users (they were designed for that).
Finally, they handle big amount of data more easily (explosion of social media, IoT, cloud-based data, dynamic types)

Scaling-up vs scaling-out

The main challenge of databases is how do we improve their performance? For this there are two types of scaling: scaling-up or scaling-out.

Scaling-up: upgrade server characteristics but beware, by the end of the day, it will reach a bottleneck as it is not infinite.

Scaling-out: create slave replication and offload requests. The first type of offload was that write happens on master and read on slave. But what happens when the slave cannot catch up with the writes? Because of this, master/master replication appeared and sharding.

Sharding: A shard is a horizontal partition of data in a database. Each individual partition is referred to as a shard. Each shard is held on a separate database server instance, to spread load. The server instance on which data is stored is determined by a specific key derived from the data to store.

ACID

The ACID model of database design is one of the oldest and most important concepts of database theory. It sets forward four goals that every database management system must strive to achieve: atomicity, consistency, isolation, and durability.

Atomicity

Atomicity states that database modifications must follow an “all or nothing” rule. Each transaction is said to be “atomic.” If one part of the transaction fails, the entire transaction fails. It is critical that the database management system maintains the atomic nature of transactions in spite of any DBMS, operating system, or hardware failure.

Consistency

Consistency states that only valid data will be written to the database. If for some reason, a transaction is executed that violates the database’s consistency rules, the entire transaction will be rolled back, and the database will be restored to a state consistent with those rules. On the other hand, if a transaction successfully executes, it will take the database from one state that is consistent with the rules to another state that is also consistent with the rules.

Isolation

Isolation requires that multiple transactions occurring at the same time not impact each other’s execution. For example, if Joe issues a transaction against a database at the same time that Mary issues a different transaction, both transactions should operate on the database in an isolated manner. The database should either perform Joe’s entire transaction before executing Mary’s, or vice-versa. This prevents Joe’s transaction from reading intermediate data produced as a side effect of part of Mary’s transaction that will not eventually be committed to the database. Note that the isolation property does not ensure which transaction will execute first — merely that transactions will not interfere with each other

Durability

Durability ensures that any transaction committed to the database will not be lost. Durability is ensured through the use of database backups and transaction logs that facilitate the restoration of committed transactions in spite of any subsequent software or hardware failures.

CAP theorem

CAP theorem, also named Brewer’s theorem, states that it is impossible for a distributed data store to simultaneously provide more than two out of the following three guarantees:

Consistency: Every read receives the most recent write or an error
Availability: Every request receives a (non-error) response — without the guarantee that it contains the most recent write
Partition tolerance: The system continues to operate despite an arbitrary number of messages being dropped (or delayed) by the network between nodes

Common databases per CAP guarantees supported

The CAP theorem implies that in the presence of a network partition, one has to choose between consistency and availability. Note that consistency, as defined in the CAP theorem, is quite different from the consistency guaranteed in ACID (see above)!

When a network partition failure happens should we decide to:

Cancel the operation and thus decrease the availability but ensure consistency
Proceed with the operation and thus provide availability but risk inconsistency

Please leave your thoughts about database models that I might have missed in the comments below!