Asynchronous processing with in-memory databases or how to handle one million transactions per second on a single CPU core

Hey!

In a couple of my last articles, I was talking about persistence with in-memory databases. Check this out here and here.

In this article, I would like to touch upon the performance problems of in-memory databases. For starters, let’s just talk about performance in the simplest case, when you change the value of a specified key. And let’s simplify this case even further: assume there’s no database server at all. I mean, no client-server interaction over network. So, the database resides totally inside your application’s RAM space.

If you didn’t have a database server, then you would probably store key-value pairs inside your application’s memory in a hash table. In C/C++, it would be a data structure like this:

std::unordered_map

In order to check how fast this data structure is, I created a file named

1.cpp

with the content as follows:

#include <map>
#include <unordered_map>
#include <iostream>
const int SIZE = 1000000;
int main()
{
std::unordered_map<int,int> m;
m.reserve(1000000);
long long c = 0;
   for (int i = 0; i < SIZE;++i)
c += m[i*i] += i;
   std::cout << c << std::endl;
}

Then I compiled and ran it:

g++ -std=c++11 -O3 1.cpp -o 1
MacBook-Air-anikin:Downloads anikin$ time ./1

and got the result:

real 0.465s
user 0.422s
sys 0.032s

What observations can we do from that? What I did was:

  1. I love C/C++ :-)
  2. I love my good old MacBook Air (actually, I don’t, because it’s getting slow over time, but let me leave it for another article)
  3. I love using -O3 option. Some folks are afraid of that, but please don’t, otherwise performance will be as poor as this:
MacBook-Air-anikin:Downloads anikin$ g++ -std=c++11 1.cpp -o 1
MacBook-Air-anikin:Downloads anikin$ time ./1

The result without O3 is twice as worse:

real 0.883s
user 0.835s
sys 0.033s

4. The application spent most of the time for the user mode. A little bit of time spent for the system mode here was a fixed cost needed, I believe, to initially allocate pages for the hash table (and there is no O3 helping to reduce this time), to perform mmaps and to load the executable file.

5. This application inserts roughly one million keys into the hash table. The word roughly here means that it can be less than one million because of repeating keys caused by an overflow at i*i. So, insertions can become updates. But the number of operations with the hash table is still one million.

6. This application inserts one million keys and runs around 0.5 seconds, so it makes roughly two million set-a-value-by-a-key operations per second.

The last observation is very interesting to me. You can consider that you have already got an in-memory key-value storage engine named std::unordered_map, which is able to perform two million key-value operations per second on a single CPU core on a good old MacBook Air:

MacBook-Air-anikin:Downloads anikin$ uname -a
Darwin MacBook-Air-anikin.local 13.4.0 Darwin Kernel Version 13.4.0: Mon Jan 11 18:17:34 PST 2016; root:xnu-2422.115.15~1/RELEASE_X86_64 x86_64

Notice that I used integer keys and values. I could use strings, but I didn’t just because I didn’t want memory allocation and copying to interfere with the test. Also, if you use std::unordered_map<std::string, …>, then you’re likely to have collisions that will reduce performance.

What we see now is that an in-memory hash table is able to perform 2 million operations per second on a single CPU core. But, remember, I was going to talk about in-memory databases. How is an in-memory database different from an in-memory hash table? Well, a database is a server application, whilst a hash table is a library. So, a database is a hash table plus something more. :-) And this something more includes at least a server application.

Let’s build a database server application around std::unordered_map<int, int>. A naive approach could look like this:

1. Accept connections in the main thread.

2. Create a new thread for each accepted connection (or let’s have a thread pool with threads created in advance).

3. Protect std::unordered_map by some synchronization primitive, e.g. mutex.

4. And don’t forget about persistence — log each update to a transaction log.

I don’t want to bore you with writing code for this server. So, assume we have already done that. Indeed, look at any database server based on a one-thread-per-connection architecture (MySQL, MariaDB, Postgres etc). It can perform tens of thousands of requests per second on a single CPU core in the best case scenario. The best performance for traditional databases that I could find on Internet was around one million queries per second. It was MariaDB running on a 20-core machine. See details here. So, it is 50K requests per second. One of the best databases on the market, tuned by probably the best database people on earth can only provide 50K requests per second on a single CPU core.

50K vs 2 million — the difference is 40 times as compared with std::unordered_map. How do you like this? You just surrounded a data structure with a server in order to grant other applications remote access to this data structure and got a 40-time performance penalty! This is so sad that I think that we should forget about multi-tier architecture and write all the business logic and all the database logic in a single application inside a single process. Or … we can try to possibly optimize the database server.

Let’s look closer at what’s happening when a database server with the above-mentioned architecture processes a transaction in terms of system calls.

1. Read a request from the network.

2. Lock the hash.

3. Unlock the hash.

4. Write to the transaction log.

5. Write to the network.

These are at least 5 syscalls to a database per request. Each syscall requires entering the kernel mode and exiting the kernel mode.

On entering/exiting the kernel mode, there is a context switching. Context switching implies tons of extra work — you need to back up and to restore all the registers and other sensitive information. See details here.

To give you an idea of how bad syscalls are, I wrote another program (in C):

#include <stdio.h>
#include <fcntl.h>
#include <unistd.h>
int main()
{
int fd = open(“/dev/zero”, O_RDONLY);

for (int i = 0; i < 1000000; ++i)
{
unsigned char c;
if (read(fd, &c, 1) == -1)
{
fprintf(stderr, “error on read\n”);
break;
}
}
   close(fd);
return 0;
}

What it does is just calling read byte by byte from the /dev/zero file one million times. The result of this test is as follows:

MacBook-Air-anikin:Downloads anikin$ time ./2
real 0.639s
user 0.099s
sys 0.495s

First of all, the program spends nearly all the time in the kernel mode. Second of all, it makes roughly 1.5 million syscalls per second. Remember that the hash lookup rate was about two million times per second. Isn’t it interesting that a single read syscall is 30% slower than a hash table lookup? And that was a very simple syscall. It didn’t access disk or network. It just returned zeros.

As you see above, within a database server we need at least 5 syscalls to handle a hash table lookup. So, we need at least 5*1.3=6.5x time just for syscalls! If you think about syscalls as taxes, then this is like an 85% tax. Would you be happy if you paid an 85% tax on your salary? I mean, what if you got in hand only $15 out of $100 that you made? Then let’s think about the real read, write and other syscalls. They do a lot of work: read from a network buffer, allocate slabs in the Linux kernel, look up and change internal kernel structures etc. So, 95%+ of the tax does not look that fantastic. Indeed, you can strace MySQL or any other traditional database and check out how many syscalls it issues during request processing.

Ok, syscalls are evil. Can we abandon syscalls and move all the database logic to the kernel? That’s a good idea. But probably there’s something more feasible. Let’s look at another example:

#include <stdio.h>
#include <fcntl.h>
#include <unistd.h>
int main()
{
int fd = open(“/dev/zero”, O_RDONLY);
for (int i = 0; i < 1000; ++i)
{
unsigned char c[1000];
if (read(fd, &c, 1000) == -1)
{
fprintf(stderr, “error on read\n”);
break;
}
}
   close(fd);
return 0;
}

And the result is:

MacBook-Air-anikin:Downloads anikin$ time ./2
real 0.007s
user 0.001s
sys 0.002s

This program does exactly the same thing as the previous one — it copies one million bytes from /dev/zero. And its running time is 7 ms compared with 639 ms in the previous example. That is almost 100 times faster! What is the trick? The trick here is that we did less syscalls and more work per a syscall. It turns out that syscalls are not bad if they do a lot of work. You pay only fixed cost per a syscall, and then you can use them almost for free. It’s like in the Disneyland — pay once for admission and use it all day long. Or less than all day long, still paying the same admission, but paying more per an attraction.

So, in order to speed up our database server, we just need to issue less syscalls and do more work within each syscall. How to do that? Let’s just group requests and processing:

1. Read 1000 requests from the network (via a single read syscall).

2. Lock the hash.

3. Process 1000 requests.

4. Unlock the hash.

5. Write 1000 transactions to the transaction log (via a single write/writev syscall).

6. Write 1000 responses to the network (via a single write syscall).

Great! But wait a minute, a database is not a batch processor. It is an online transaction processor. As soon as it got a request, it should process it immediately with the best latency it can. It can’t wait for a whole batch of 1000 requests to come in because it can wait forever.

How to solve this problem?

Let’s look at public transport. They have already solved this problem in the last hundred years. A bus ticket is cheaper than a taxi cab, because buses have higher throughput — less cost per passenger. But the latency (waiting time) of a bus (or a train — it depends on the public transport strategy of a specific city) is roughly the same as that of a taxi cab in a busy downtown (I saw that trend at least in New York City, Moscow, Paris, Berlin, Tokyo). What is the trick?

The trick is that a bus never waits for a hundred of passengers to arrive to a bus stop. It always has enough passengers at a bus stop to pick them up, because the downtown is busy (the workload is high). This way, a bus pulls in, picks everybody up (until the bus is full or until there is nobody at the bus stop), and pulls out immediately. Then the next bus pulls in and picks up enough people again, because they’ve already arrived at the bus stop since the previous bus was gone.

To make the same thing to a database, we need to treat a network subsystem, a transaction processor and a disk subsystem as independent buses (or trains if you wish). Each of those buses is working asynchronously with respect to others. And each of those buses onboards as many people as it has at the bus stop. If there is not enough people at the bus stop, then — yes — we use CPU not efficiently, because buses with high fixed costs are almost empty. But on the other hand, who cares? We perfectly serve the workload that we get. This way, we can have 99% CPU usage on 10K requests per second and the same 99% CPU usage on 1000K requests per second with the same good latency, because the number of syscalls is the same. And that number matters more than the amount of bytes transferred within a syscall. The CPU is always busy, but it magically stretches up to higher workloads staying as busy as it was with smaller workloads. Remember that the latency of a syscall doing 100x work is almost the same as that of a syscall doing 1x work, because of an extremely high fixed cost per a syscall.

How to implement the whole thing best? Just by doing everything asynchronously:

1. In Tarantool, we maintain three threads: the network thread, the transaction processing (we call it TX) thread and the disk thread.

2. The network thread reads requests from the network (whatever amount of requests it can read from a network buffer without blocking on I/O, whether it’s 1 or 1000 or more). And then it puts all the requests to the TX. Also, it gets responses from the TX and writes them to the network. And again, it does it within a single network package, no matter if it contains a single response or thousands of responses.

3. The TX thread just processes transactions in memory, group by group, taking them from the network thread. After it’s done with processing a group in memory, TX passes the group to the disk thread. And again, it passes transactions to the disk thread group by group, handling as much data as it has. Like people step out of the train within a big group and all go to the bus. The bus takes everyone who reaches the bus until it has no people at the bus stop. Folks who are running late are going to take another bus. A bus doesn’t wait even a millisecond — if there is nobody just after the last person, then it hits the road. As long as the disk thread is done with the group, it returns this group to the TX thread to commit transactions and to return all the requests within this group to the network thread.

That’s how we significantly reduced the number of syscalls in Tarantool. And this is how it works and handles one million transactions on a single CPU core: https://gist.github.com/danikin/a5ddc6fe0cedc6257853

You can see the entire workflow at this picture:

The main thing here is that each thread is working in parallel and does not interfere with other threads making their job. The more parallel the workload is and the heavier the workload is, the less the number of syscalls per request is and the more requests per second the system can handle.

At the same time, the latency is good, because threads don’t wait for other threads, they just handle as much work as they have right now, and while they’re handling it, a new piece of work is being prepared in parallel.

There will be new articles about transaction processing in an in-memory database. Stay tuned! :-)