What an in-memory database is and how it persists data efficiently

Hey guys.

Probably you’ve heard about in-memory databases. If not, then check out a quick overview of what they are: https://en.wikipedia.org/wiki/In-memory_database

To make the long story short, an in-memory database is a database that keeps the whole dataset in RAM. What does that mean? It means that each time you query a database or update data in a database, you only access the main memory. So, there’s no disk involved into these operations. And this is good, because the main memory is way faster than any disk. A good example of such a database is Memcached.

But wait a minute, how would you recover your data after a machine with an in-memory database reboots or crashes? Well, with just an in-memory database, there’s no way out. A machine is down — the data is lost. Forget about it.

Is it possible to combine the power of in-memory data storage and the durability of good old databases like MySQL or Postgres? Sure! Would it affect the performance? Surprisingly, it won’t!

Here come in-memory databases with persistence like Redis, Aerospike, Tarantool.

You may ask: how can in-memory storage be persistent? The trick here is that you still keep everything in memory, but additionally you persist each operation on disk in a transaction log. Look at the picture:

The first thing that you may notice is that even though your fast and nice in-memory database has got persistence now, queries don’t slow down, because they still hit only the main memory like they did with just an in-memory database. Good news! :-) But what about updates? Each update (or let’s name it a transaction) should not only be applied to memory but also persisted on disk. A slow disk. Is it a problem? Let’s look at the picture:

Transactions are applied to the transaction log in an append-only way. What is so good about that? When addressed in this append-only manner, disks are pretty fast. If we’re talking about spinning magnetic hard disk drives (HDD), they can write to the end of a file as fast as 100 Mbytes per second. If you don’t believe me, then try to run this test in a Unix/Linux/MacOS shell:

cat /dev/zero >some_file

and check out how fast some_file is growing. So, magnetic disks are pretty fast when you use them sequentially. On the other hand, they’re utterly slow when you use them randomly. They can normally complete around 100 random operations per second.

If you write byte-by-byte, each byte put in a random place of an HDD, you can see some real 100 bytes per second as the peak throughput of the disk in this scenario. Again, it is as little as 100 bytes per second! This tremendous 6-order-of-magnitude difference between the worst case scenario (100 bytes per second) and the best case scenario (100,000,000 bytes per second) of disk access speed is based on the fact that, in order to seek a random sector on disk, a physical movement of a disk head has occurred, while you don’t need it for sequential access as you just read data from disk as it spins, with a disk head being stable.

If we consider solid-state drives (SSD), then the situation will be better because of no moving parts. But still the best/worst ratio is similar — sequential access gives 200–300 Mbytes per second, and random access gives 1000–10000 seeks per second — that is 4–5 orders of magnitude. Here are some numbers: https://en.wikipedia.org/wiki/IOPS.

So, what our in-memory database does is it floods the disk with transactions as fast as 100 Mbytes per second. Is that fast enough? Well, that’s real fast. Say, if a transaction size is 100 bytes, then this will be one million transactions per second! This number is so high that you can definitely be sure that the disk will never be a bottleneck for your in-memory database. Here is a detailed benchmark of one in-memory database showing one million transactions per second, where the bottleneck is the CPU, not the disk: https://gist.github.com/danikin/a5ddc6fe0cedc6257853.

To summarize all that was said above about disks and in-memory databases:

  1. In-memory databases don’t use disk for non-change operations.
  2. In-memory databases do use disk for data change operations, but they use it in the fastest possible way.

Why wouldn’t regular disk-based databases adopt the same techniques? Well, first, unlike in-memory databases, they need to read data from disk on each query (let’s forget about caching for a minute, this is going to be a topic for another article).

You never know what the next query will be, so you can consider that queries generate random access workload on a disk, which is, remember, the worst scenario of disk usage. Second, disk-based databases need to persist changes in such a way that the changed data could be immediately read.

Unlike in-memory databases, which usually don’t read from disk unless for recovery reasons on starting up. So, disk-based databases require specific data structures to avoid a full scan of a transaction log in order to read from a dataset fast. One such data structure of the kind is a B/B+ tree. The flip side of this data structure is that you should change a B/B+ tree on each change operation, which could constitute random workload on a disk. While improving the performance for read operations, B/B+ trees are degrading it for write operations. There is a handful of database engines based on B/B+ trees. These are InnoDB by MySQL or Postgres storage engine. There is also another data structure that is somewhat better in terms of write workload — LSM tree. This modern data structure doesn’t solve problems with random reads, but it partially solves problems with random writes. Examples of such engines are RocksDB, LevelDB or Vinyl. You can see a summary on this picture:

So, in-memory databases with persistence can be real fast on both read/write operations. I mean, as fast as pure in-memory databases, using a disk extremely efficiently and never making it a bottleneck.

P.S.

The last but not least topic that I want to partially cover here is snapshotting. Snapshotting is the way transaction logs are compacted. A snapshot of a database state is a copy of the whole dataset. A snapshot and latest transaction logs are enough to recover your database state. So, having a snapshot, you can delete all the outdated transaction logs that don’t have any new information on top of the snapshot.

Why would we need to compact logs? Because the more transaction logs, the longer the recovery time for a database. Another reason for that is that you wouldn’t want to fill your disks with outdated and useless information (to be perfectly honest, old logs sometimes save the day, but let’s make it another article).

Snapshotting is essentially once-in-a-while dumping of the whole database from the main memory to disk. Once we dump a database to disk, we can delete all the transaction logs that do not contain transactions newer than the last transaction checkpointed in a snapshot. Easy, right? This is just because all other transactions from the day one are already considered in a snapshot.

You may ask me now: how can we save a consistent state of a database to disk, and how do we determine the latest checkpointed transaction while new transactions keep coming? Well, see you in the next article. Stay tuned :-)