Let’s talk databases. Why do we need databases rather than good old files? How are files inferior to databases, or how are databases superior to files?
The answer is obvious. Database is a better-structured storage system. It allows performing transactions, making queries, etc. That’s why we love databases. But sometimes we hate them as well when databases cause us a headache.
Here’s the simplest problem that you may have with a database. You have a server with a database and a number of applications that make requests to the server. The database is responding, performing queries, making updates, and everything goes well until the workload grows too high for the database to handle.
Assuming that this is a read-only workload, the problem can be solved using replication. You can set up as many replicas as you need and run all read-only operations on the replicas, while all updates are performed on the master.
If this is a read/write workload, then replication — and even master-master replication — won’t help, since all updates will hit all the replicas or masters at the end of the day. Therefore, no matter how many replicas you set up, you won’t reduce the workload. In this scenario, you need sharding.
If the database can’t handle a write workload, you shard it. In other words — cut your database into pieces and put the pieces onto different machines. You can shard your database almost endlessly. Sharding is more complex than replication, because you need to somehow cut your dataset into pieces. You can do this by tables, or inside a table by hash, range, field, etc. — there are many options.
So, by adding replicas and by sharding a database, you can handle almost any workload. Seems like there is nothing else to wish for, the issue is closed, right?
But there is a problem…
… that lies beyond the field of technologies. Watching the rapidly growing number of servers, your boss gets frustrated because all of that costs money. The number of users’ requests goes up, and so does the number of queries and transactions, so you keep adding more servers. After all, you are just a techie, you don’t think about money — leave it to the Finance. So you’re telling your boss: “It’s OK. We have an infinitely scalable system. We are adding servers and it works awesome”. And your boss replies: “Great, but we are losing money. We need to do something about that. And if we don’t solve this problem, we’ll have to close the whole thing down. Even though our business is growing, our costs for databases and servers are growing even faster”. And this problem is for you to solve, not for the Finance, because most likely it has something to do with technology. What’s next? Go to a cloud service? It’s way more expensive. Optimization? Let’s assume that you’ve already optimized all the queries and created all the necessary indexes.
The solution might be to cache frequently selected data. It can be stored in some cache and accessed without numerous replicas or shards.
Problems with a cache
Great, problem solved: one Memcached instance substitutes the whole rack of replica servers. But it comes with a price:
- Your application is writing data to both the cache and the database, which are not replicated between each other, and that causes data inconsistency. For example, you’re changing the cache first, and then the database. For some reason, the operation on the database fails: the application crashes, or there are network errors. Then a user gets an error message, but the cache changes are not rolled back. So, we have some data in the cache, and different data in the database. Nobody knows about that, and the application continues to work assuming that the cache is the source of truth, but actually it is not. When the cache is restarted, the updated copy of data in the cache is lost, because there is a different version in the database.
The funny thing is that data is lost, no matter in which order you write. See for yourself: you first write data to the database, and then the cache update fails. The application works with the old piece of data in the cache, while the corresponding piece of data in the database is different, and again — nobody knows about that. The cache is restarted, and the data is lost again.
So, in either case an update is lost. That means, you’re losing some basic database property, more specifically — a guarantee that the committed data remains in the database forever, i.e. a commit isn’t a commit anymore.
You can deal with data inconsistency by developing a smart cache, so that the application would work with the cache only, which in turn would update the database in a write-through manner. The smart cache can first change the database, and only then change itself. If for some reason the data wasn’t changed in the database, nor should it be changed in the cache. This way, the data will almost always be consistent. An internal cache update can’t fail, because the cache is in-memory storage, and an in-memory update would always succeed, except the case when the memory is damaged, but this problem is extremely rare and beyond the scope of this article. Anyway, even when RAM is damaged, a smart cache will be very likely to crash and bury all the cached data. This is bad, but it won’t result in data inconsistency.
Still there is another rare case event with a smart cache when the data turns out inconsistent: an application updates the cache, the cache further applies the change to the database, so the data is actually committed to the database. Next, the database confirms to the cache that the operation was successful, but at this moment network connection fails, and the cache doesn’t receive the confirmation. It thinks that the data wasn’t committed to the database, so the cache doesn’t change the data in its RAM. But the data was actually committed to the database. The application keeps working with the old data, then the cache restarts — oops, the data is different again. This case is very rare, but it may happen.
You can read some details about how Facebook minimizes the probability of inconsistency inside their caching layer named TAO. So you can use some special techniques to minimize it, but not to eliminate it.
- Worse still, a cache (either smart or not) doesn’t solve the sharding problem, because updates become no faster. Every commit has to be committed to a durable store, that is not the cache. Meanwhile, your boss hates sharding because it requires ever more costs to purchase ever more servers.
- Next problem: a cache is not a database, but normally a regular key/value store. Forget about complex queries and transactions. Say goodbye to stored procedures or secondary indexes. So you have to completely rewrite your application.
- Problem number four — the “cold start”. A newly-started cache is empty, it has no data. At this point, all the SELECT requests go straight to the database past the cache because there is nothing in the cache yet. Therefore, you need to add more replicas again — maybe fewer than you’d need without the cache — to “warm up” the cache somehow and process the big number of SELECT requests that temporarily go straight to the database. Thus we need a number of replicas only to warm up the cache. It looks rather wasteful, doesn’t it? But you can’t have a quick start without these replicas. Let’s take a closer look at this problem.
There was an idea once to not “cool off” the data in order to keep it “warm“. To do so, a cache needs to be persistent, i.e. the data must be stored somewhere on disk, and then everything will be fine. The cache will start up and upload the data. But there’s a catch: cache is in-memory storage, it has to be fast; but when you pair it with a disk, wouldn’t this make the cache as slow as a database? Actually no, it wouldn’t, if persistence is implemented properly.
The easiest way to “persist” your cache is to regularly dump it all on disk. This can be done asynchronously, in the background. Dumping doesn’t make any operation slower, nor does it load the processor too much. Instead, it makes the warm-up period considerably shorter: a newly-started cache already has its own data snapshot, and it reads the snapshot linearly and quickly. This allows the cache to get warm way faster than using any number of database replicas.
Can the solution be that easy? Well, assume that we make a dump every 5 minutes. If there is a crash in between, all the changes made since the previous dump will be lost. For applications like statistics it’s OK, while for many other applications it is unacceptable.
Now let’s consider another problem with dumping: a dump requires a lot of disk bandwidth, but there can be another operation (e.g. event logging) that needs the disk at the same time. During the dump, other disk operations will be running slow, and that will keep happening. We can avoid this by maintaining a transaction log instead of regularly dumping the cache. Here inevitably comes a question: “How’s that possible? It’s a cache, it’s fast, and we are here logging each transaction”. Actually, that’s not a problem. If we’re logging transactions into a file sequentially, on a regular spinning hard drive, the write speed will be up to 100 Mb/sec. Let’s say that an average transaction size is 100 bytes; that’s one million transactions per second. Keeping these numbers in mind, we’ll obviously never exceed the disk performance while logging the cache. By the way, here is a test showing that one million transactions per second is really possible. Transaction logging instead of dumping also solves the IOPS problem: we load the disk just as much as necessary to persist all the data. So, the data is always “fresh”, we don’t lose it, and the warm-up is fast.
But transaction logging has its cons. When maintaining the log, updates for the same element don’t get grouped in a single record. When you end up with multiple updates, and the cache has to go through all the log records at startup, it can take longer than starting from the dump. Besides, the log itself can take up plenty of space; maybe not even fit on the disk.
To solve this problem, let’s combine both dumping and logging. Why not? We can make a dump, i.e. create a snapshot, just once a day, and keep logging all updates as they occur. In the snapshot, we are saving the latest modification ID. When we need to restart the cache, we read the snapshot and upload it to the memory, then read the log starting from the latest snapshot modification, and apply the updates to RAM. That’s it: the cache is warmed up. It’s as fast as reading from the dump. So, we’re done with the cold start; now let’s solve the other problems in our list.
The other three problems
Let’s consider the temperature of data. I mean hot and cold data. Hot is the most frequently accessed data, and all the other data is cold. Since we are talking about a cache as an efficient way of handling frequently accessed data, we definitely have the ability to divide our dataset into cold and hot parts.
As a rule, there is plenty of cold data and very little hot data. That’s the way it goes. But we are replicating and sharding the whole database into multiple replicas and shards mostly to process queries and transactions with the hot data. So small but hot data costs us too much. We can ask ourselves: “Why copying everything? Let’s shard hot data only”. But that won’t help: we’ll have to use the same number of servers, since we are replicating and sharding not because the data doesn’t fit into the memory or disk, but because we are running out of CPU power. Therefore, sharding and replicating hot data alone is not the solution. And your boss is still mad because he has to pay for the new servers.
What can be done? We have a cache, while the problem is hot data in the database. But wait, the cache also stores data just like the database. We can replicate it to ensure fault-tolerance. So, what stops us from using the cache as a primary data source? Lack of transactions? Fine, we can add transactions. Lack of other database properties? We can implement them in the cache.
This way, we solve the other three problems as long as we don’t have to store hot data in the database, only in the cache! Which means “no” to inconsistency and “yes” to database features which we can implement inside the cache. We don’t need sharding either, since we don’t have to distribute the database among multiple servers; the cache successfully handles the whole read/write workload. Yes, a persistent cache handles heavy write workload because, remember, transaction logging and snapshot dumping affect throughput only slightly.
So, all database features can be implemented in a cache. That’s what we did, and the name of the resulting product is Tarantool. In terms of reading and writing, it works as fast as a cache meanwhile having all the database features we need. The fast and durable single source of truth. Therefore, we don’t have to back Tarantool with any other database. All problems solved!
Tarantool capabilities and specifics
We’ve been replicating and sharding lots of cold data just in order to process a small amount of hot data. Now the rarely requested and modified cold data stays in a traditional database, and hot data goes to Tarantool. In other words, Tarantool is a database for hot data. As a result, two Tarantool instances (master and replica) are enough for most tasks. Actually, we can get away with just one instance, since its access pattern and throughput are the same as for a regular cache despite the fact that Tarantool is a true database.
There is another problem solved by Tarantool. Sort of a psychological problem: how can you give up on a database, dependable data storage with its cozy ACID transactions, and go down to cache? Indeed, as soon as you start using Memcached or any other cache on top of a database, you ditch the benefits of a database. Remember our talk about inconsistency and lost updates earlier in this article. From this perspective, Tarantool doesn’t only speed up your work and helps you save money, but it also brings you back to the world of databases with transactions, stored procedures, secondary indexes, etc.
Let me say a few words about parallel transactions. Tarantool can act as a Lua application server. It considers Lua script as one transaction: it performs all reads from the memory, and sends all changes to a temporary buffer, so they eventually get written to disk as one piece. While the data is being written, another transaction can already read the old version of data without any locks! Transactions can slow down only if you exceed the throughput capacity of sequential disk writes — which, remember, gives you the tremendous throughput of roughly one million transactions per second.
How we move data from hot to cold
So far, this process is not automatic in Tarantool. But we’re working on it. We analyze the logs of our applications and make a decision to consider some data hot. For example, the users’ profiles of the Mail.Ru email service are mostly hot because they’re frequently requested; therefore, we transfer all the profiles to Tarantool. We might grab the cold ones, too, as long as some users don’t use the e-mail service anymore. But even with this overrun, it’s much more effective to use Tarantool here than an SQL database. For one thing, Tarantool has a highly optimized memory footprint. Here’s a very interesting fact: an SQL database stores everything on disk, but 10–20% of data must be cached in the memory. Still the memory footprint of traditional SQL databases is 1.5–2 times more than Tarantool’s footprint, turning these 20% into 30–40%. So, with the same workload, an SQL server is no better than Tarantool even in terms of memory, although it doesn’t handle this workload anyway.
Tarantool vs. Redis
For us, there are two essential differences between Tarantool and Redis.
- Tarantool is a real database, with commit and rollback statements, secondary keys, etc. You can write server-side scripts and stored procedures in Lua. Redis also has Lua, but its power is very limited. In this regard, Tarantool is perfect. It’s faster, and it allows using transactions anywhere you want. There is no need to get a key from the cache, update it, and put it back, while someone else can be simultaneously modifying the key.
- According to our tests, Tarantool is 30% faster. See the test results on Tarantool’s website and in this article.
One million dollars
This sum of money is not just made up for an eye-catching header. It’s the money actually saved in one of the Mail.Ru Group projects. We had to store the users’ profiles somewhere. They used to be kept in old storage, and we were wondering where we could move them. First, we were considering MySQL. We deployed 16 MySQL replicas and shards, and began to gradually duplicate the reading and writing workload from the profiles. Profiles are small portions of information (500b-1Kb) containing names, number of sent e-mails, various flags and service data that is usually required for every page load. This data is frequently requested and updated. The 16-server MySQL farm went down at 1/8 of our workload. And that’s after all the optimization we performed! Then we decided to give Tarantool a try. Guess what? Tarantool proved to need just four servers to handle the workload that could have been distributed among more than 128 servers. Actually, one server was enough: we added 3 for fault-tolerance. Our economy — 128 servers and reduced hosting expenses — equaled the above-declared one million dollars.
And that’s just one case. We found a use for Tarantool in many of our projects. For example, there are 120 hard-working instances in our e-mail and cloud services. If we used SQL based solutions, we’d need tens of thousands servers with MySQL or other commercial SQL database management systems — PostgreSQL, Oracle, whatever. You can hardly estimate the cost of all of this. The moral of this story is as follows: every task needs a proper tool; this allows not to lose tons of money on a simple feature. Cold data should be stored in an SQL database, and hot data — which is requested and updated frequently — should be stored in a storage system specially adapted for this purpose, which is Tarantool.
We are currently developing Tarantool version 1.7 which will feature a completely automatic cluster solution with sharding and RAFT-type replication. Stay tuned!