How to save one million dollars on databases with Tarantool

Denis Anikin
Apr 5, 2016 · 13 min read

Let’s talk databases. Why do we need databases rather than good old files? How are files inferior to databases, or how are databases superior to files?

The answer is obvious. Database is a better-structured storage system. It allows performing transactions, making queries, etc. That’s why we love databases. But sometimes we hate them as well when databases cause us a headache.

Here’s the simplest problem that you may have with a database. You have a server with a database and a number of applications that make requests to the server. The database is responding, performing queries, making updates, and everything goes well until the workload grows too high for the database to handle.

Assuming that this is a read-only workload, the problem can be solved using replication. You can set up as many replicas as you need and run all read-only operations on the replicas, while all updates are performed on the master.

If this is a read/write workload, then replication — and even master-master replication — won’t help, since all updates will hit all the replicas or masters at the end of the day. Therefore, no matter how many replicas you set up, you won’t reduce the workload. In this scenario, you need sharding.

If the database can’t handle a write workload, you shard it. In other words — cut your database into pieces and put the pieces onto different machines. You can shard your database almost endlessly. Sharding is more complex than replication, because you need to somehow cut your dataset into pieces. You can do this by tables, or inside a table by hash, range, field, etc. — there are many options.

So, by adding replicas and by sharding a database, you can handle almost any workload. Seems like there is nothing else to wish for, the issue is closed, right?

But there is a problem…

… that lies beyond the field of technologies. Watching the rapidly growing number of servers, your boss gets frustrated because all of that costs money. The number of users’ requests goes up, and so does the number of queries and transactions, so you keep adding more servers. After all, you are just a techie, you don’t think about money — leave it to the Finance. So you’re telling your boss: “It’s OK. We have an infinitely scalable system. We are adding servers and it works awesome”. And your boss replies: “Great, but we are losing money. We need to do something about that. And if we don’t solve this problem, we’ll have to close the whole thing down. Even though our business is growing, our costs for databases and servers are growing even faster”. And this problem is for you to solve, not for the Finance, because most likely it has something to do with technology. What’s next? Go to a cloud service? It’s way more expensive. Optimization? Let’s assume that you’ve already optimized all the queries and created all the necessary indexes.

The solution might be to cache frequently selected data. It can be stored in some cache and accessed without numerous replicas or shards.

Problems with a cache

Great, problem solved: one Memcached instance substitutes the whole rack of replica servers. But it comes with a price:

Cold start

There was an idea once to not “cool off” the data in order to keep it “warm“. To do so, a cache needs to be persistent, i.e. the data must be stored somewhere on disk, and then everything will be fine. The cache will start up and upload the data. But there’s a catch: cache is in-memory storage, it has to be fast; but when you pair it with a disk, wouldn’t this make the cache as slow as a database? Actually no, it wouldn’t, if persistence is implemented properly.

The easiest way to “persist” your cache is to regularly dump it all on disk. This can be done asynchronously, in the background. Dumping doesn’t make any operation slower, nor does it load the processor too much. Instead, it makes the warm-up period considerably shorter: a newly-started cache already has its own data snapshot, and it reads the snapshot linearly and quickly. This allows the cache to get warm way faster than using any number of database replicas.

Can the solution be that easy? Well, assume that we make a dump every 5 minutes. If there is a crash in between, all the changes made since the previous dump will be lost. For applications like statistics it’s OK, while for many other applications it is unacceptable.

Now let’s consider another problem with dumping: a dump requires a lot of disk bandwidth, but there can be another operation (e.g. event logging) that needs the disk at the same time. During the dump, other disk operations will be running slow, and that will keep happening. We can avoid this by maintaining a transaction log instead of regularly dumping the cache. Here inevitably comes a question: “How’s that possible? It’s a cache, it’s fast, and we are here logging each transaction”. Actually, that’s not a problem. If we’re logging transactions into a file sequentially, on a regular spinning hard drive, the write speed will be up to 100 Mb/sec. Let’s say that an average transaction size is 100 bytes; that’s one million transactions per second. Keeping these numbers in mind, we’ll obviously never exceed the disk performance while logging the cache. By the way, here is a test showing that one million transactions per second is really possible. Transaction logging instead of dumping also solves the IOPS problem: we load the disk just as much as necessary to persist all the data. So, the data is always “fresh”, we don’t lose it, and the warm-up is fast.

But transaction logging has its cons. When maintaining the log, updates for the same element don’t get grouped in a single record. When you end up with multiple updates, and the cache has to go through all the log records at startup, it can take longer than starting from the dump. Besides, the log itself can take up plenty of space; maybe not even fit on the disk.

To solve this problem, let’s combine both dumping and logging. Why not? We can make a dump, i.e. create a snapshot, just once a day, and keep logging all updates as they occur. In the snapshot, we are saving the latest modification ID. When we need to restart the cache, we read the snapshot and upload it to the memory, then read the log starting from the latest snapshot modification, and apply the updates to RAM. That’s it: the cache is warmed up. It’s as fast as reading from the dump. So, we’re done with the cold start; now let’s solve the other problems in our list.

The other three problems

Let’s consider the temperature of data. I mean hot and cold data. Hot is the most frequently accessed data, and all the other data is cold. Since we are talking about a cache as an efficient way of handling frequently accessed data, we definitely have the ability to divide our dataset into cold and hot parts.

As a rule, there is plenty of cold data and very little hot data. That’s the way it goes. But we are replicating and sharding the whole database into multiple replicas and shards mostly to process queries and transactions with the hot data. So small but hot data costs us too much. We can ask ourselves: “Why copying everything? Let’s shard hot data only”. But that won’t help: we’ll have to use the same number of servers, since we are replicating and sharding not because the data doesn’t fit into the memory or disk, but because we are running out of CPU power. Therefore, sharding and replicating hot data alone is not the solution. And your boss is still mad because he has to pay for the new servers.

What can be done? We have a cache, while the problem is hot data in the database. But wait, the cache also stores data just like the database. We can replicate it to ensure fault-tolerance. So, what stops us from using the cache as a primary data source? Lack of transactions? Fine, we can add transactions. Lack of other database properties? We can implement them in the cache.

This way, we solve the other three problems as long as we don’t have to store hot data in the database, only in the cache! Which means “no” to inconsistency and “yes” to database features which we can implement inside the cache. We don’t need sharding either, since we don’t have to distribute the database among multiple servers; the cache successfully handles the whole read/write workload. Yes, a persistent cache handles heavy write workload because, remember, transaction logging and snapshot dumping affect throughput only slightly.

So, all database features can be implemented in a cache. That’s what we did, and the name of the resulting product is Tarantool. In terms of reading and writing, it works as fast as a cache meanwhile having all the database features we need. The fast and durable single source of truth. Therefore, we don’t have to back Tarantool with any other database. All problems solved!

Tarantool capabilities and specifics

We’ve been replicating and sharding lots of cold data just in order to process a small amount of hot data. Now the rarely requested and modified cold data stays in a traditional database, and hot data goes to Tarantool. In other words, Tarantool is a database for hot data. As a result, two Tarantool instances (master and replica) are enough for most tasks. Actually, we can get away with just one instance, since its access pattern and throughput are the same as for a regular cache despite the fact that Tarantool is a true database.

There is another problem solved by Tarantool. Sort of a psychological problem: how can you give up on a database, dependable data storage with its cozy ACID transactions, and go down to cache? Indeed, as soon as you start using Memcached or any other cache on top of a database, you ditch the benefits of a database. Remember our talk about inconsistency and lost updates earlier in this article. From this perspective, Tarantool doesn’t only speed up your work and helps you save money, but it also brings you back to the world of databases with transactions, stored procedures, secondary indexes, etc.

Let me say a few words about parallel transactions. Tarantool can act as a Lua application server. It considers Lua script as one transaction: it performs all reads from the memory, and sends all changes to a temporary buffer, so they eventually get written to disk as one piece. While the data is being written, another transaction can already read the old version of data without any locks! Transactions can slow down only if you exceed the throughput capacity of sequential disk writes — which, remember, gives you the tremendous throughput of roughly one million transactions per second.

How we move data from hot to cold

So far, this process is not automatic in Tarantool. But we’re working on it. We analyze the logs of our applications and make a decision to consider some data hot. For example, the users’ profiles of the Mail.Ru email service are mostly hot because they’re frequently requested; therefore, we transfer all the profiles to Tarantool. We might grab the cold ones, too, as long as some users don’t use the e-mail service anymore. But even with this overrun, it’s much more effective to use Tarantool here than an SQL database. For one thing, Tarantool has a highly optimized memory footprint. Here’s a very interesting fact: an SQL database stores everything on disk, but 10–20% of data must be cached in the memory. Still the memory footprint of traditional SQL databases is 1.5–2 times more than Tarantool’s footprint, turning these 20% into 30–40%. So, with the same workload, an SQL server is no better than Tarantool even in terms of memory, although it doesn’t handle this workload anyway.

Tarantool vs. Redis

For us, there are two essential differences between Tarantool and Redis.

One million dollars

This sum of money is not just made up for an eye-catching header. It’s the money actually saved in one of the Mail.Ru Group projects. We had to store the users’ profiles somewhere. They used to be kept in old storage, and we were wondering where we could move them. First, we were considering MySQL. We deployed 16 MySQL replicas and shards, and began to gradually duplicate the reading and writing workload from the profiles. Profiles are small portions of information (500b-1Kb) containing names, number of sent e-mails, various flags and service data that is usually required for every page load. This data is frequently requested and updated. The 16-server MySQL farm went down at 1/8 of our workload. And that’s after all the optimization we performed! Then we decided to give Tarantool a try. Guess what? Tarantool proved to need just four servers to handle the workload that could have been distributed among more than 128 servers. Actually, one server was enough: we added 3 for fault-tolerance. Our economy — 128 servers and reduced hosting expenses — equaled the above-declared one million dollars.

And that’s just one case. We found a use for Tarantool in many of our projects. For example, there are 120 hard-working instances in our e-mail and cloud services. If we used SQL based solutions, we’d need tens of thousands servers with MySQL or other commercial SQL database management systems — PostgreSQL, Oracle, whatever. You can hardly estimate the cost of all of this. The moral of this story is as follows: every task needs a proper tool; this allows not to lose tons of money on a simple feature. Cold data should be stored in an SQL database, and hot data — which is requested and updated frequently — should be stored in a storage system specially adapted for this purpose, which is Tarantool.

We are currently developing Tarantool version 1.7 which will feature a completely automatic cluster solution with sharding and RAFT-type replication. Stay tuned!

Denis Anikin

Written by

VP Tech at Citymobil (taxi aggregator)