Asynchronous work with Tarantool in Python

You could have already seen some recent articles about the Tarantool NoSQL DBMS, how it’s used at its alma mater, Mail.Ru Group, and not only there. Yet no Tarantool recipes have been published for Pythoneers. In this article, I’ll talk about how we cook “Tarantool with the Python sauce” in our projects, what problems we encounter, the pitfalls, pros and cons, and will surely say “what’s the point” of this dish. :-)

So, first things first.

Tarantool is a Lua application server. It can store data on disk and provide quick access to it. Tarantool is designed to work with significant data flows. In terms of figures, it goes about tens and hundreds of thousands operations per second. For example, more than 80,000 requests per second (SELECT, INSERT, UPDATE, DELETE) are generated in one of my projects, and the workload is evenly distributed among 4 servers with 12 Tarantool instances. Not all of the existing DBMSs are ready for such workload. Furthermore, the wait time for request completion is very expensive when processing such amount of data, that’s why the applications should be able to quickly switch from one task to another. To ensure efficient and balanced workload for all CPU cores, you’ll need Tarantool and asynchronous programming techniques.

How does the Tarantool-Python connector work?

Before we start talking about asynchronous code in Python, let’s make sure that we have a clear idea of how regular synchronous code interacts with Tarantool. I’m going to use Tarantool 1.6 for CentOS; it’s very easy to install, you can find detailed instructions on the project website along with the user guide. Thanks to the recently upgraded documentation, it’s now much easier to deal with starting and running Tarantool instances. See also our recent article Getting started with Tarantool 1.6.

So, Tarantool is up and running. To start working with Python 2.7, we need to install the tarantool-python connector from pypi:

$ pip install tarantool-python

That’s quite enough for our tasks. As an example, let’s take a real task from one of our projects. We once needed to “put” our data flow in Tarantool for further processing; the size of one data batch was approximately 1.5 Kb.

We did some fieldwork aiming to test the approaches (synchronous vs asynchronous programming) and choose the proper tools. We started with a performance test script. It looked as easy as ABC, and it took just a couple of minutes to write:

import tarantool
import string

mod_len = len(string.printable)
data = [string.printable[it] * 1536
for it in range(mod_len)]

tnt = tarantool.connect("127.0.0.1", 3301)

for it in range(100000):
r = tnt.insert("tester", (it, data[it % mod_len]))

The test algorithm is simple: we create 100 000 successive insertions in Tarantool. My virtual machine runs this piece of code in 32 seconds on average, which is about three thousand insertions per second. If we were happy with this performance, there would’ve been nothing else to do. As you know, premature optimization is the root of all evil. ;-) However, that was not enough for our project. Tarantool can boast much higher performance.

Code profiling

Before we rush in, let’s take a closer look at our piece of code and the way it works. Many thanks to my co-worker dreadatour for his articles about Python code profiling (in Russian).

It’s good to know how your code works under the hood before you run a code profiler. After all, the best profiling tool is the developer’s head. ;-) The script itself is simple, not too much to cogitate on, so let’s dig a little deeper. If we scrutinize the implementation of the tarantool-python connector, we’ll see that requests are packed using the msgpack library, sent to a socket with sendall, and then the response length and the response itself are read from the socket. That sounds more interesting. How many operations with the Tarantool socket shall we have as we run our piece of code? In our case, for one tnt.insert request there will be one socket.sendall call (sending data) and two socket.recv calls (getting the response length and the response itself). An intent look would reveal that for a 100-thousand-record insertion there will be 200K + 100K = 300K read/write system calls. The code profiler (I used cProfile and kcachegrind) confirms our calculations:

What can be improved in our algorithm? First of all, of course, we’d like to reduce the number of system calls, i.e. operations with the Tarantool socket. This can be done by grouping tnt.insert requests into a “batch” and calling socket.sendall to send all the requests at once. Likewise we can read Tarantool’s response “batch” with a single socket.recv. It’s not that easily done using the classic-style programming: we need a data buffer, time to wait and collect data in the buffer; and the request results should be returned without delays. Now, what shall we do if we have many requests at first, but suddenly we get very few? There again will be delays, which were trying to avoid. All in all, we need an absolutely new approach, and most importantly, we need to keep our code as simple as it was. This is when asynchronous frameworks come in handy.

Gevent and Python 2.7

I used to deal with several asynchronous frameworks: twisted, tornado, gevent and others. My favorite one is gevent, primarily due to its effective work with I/O operations and ease of use. Here you can find a good tutorial for this library, and this tutorial offers the classic example of a fast crawler.

import time
import gevent.monkey
gevent.monkey.patch_socket()

import gevent
import urllib2
import json

def fetch(pid):
url = 'http://json-time.appspot.com/time.json'
response = urllib2.urlopen(url)
result = response.read()
json_result = json.loads(result)
return json_result[‘datetime’]

def synchronous():
for i in range(1,10):
fetch(i)

def asynchronous():
threads = []
for i in range(1,10):
threads.append(gevent.spawn(fetch, i))
gevent.joinall(threads)

t1 = time.time()
synchronous()
t2 = time.time()
print('Sync:', t2 — t1)

t1 = time.time()
asynchronous()
t2 = time.time()
print('Async:', t2 — t1)

I got the following test results on my virtual machine:

Sync: 1.529
Async: 0.238

Hmm, nice performance improvement, isn’t it? To make my synchronous piece of code work asynchronously with gevent, I had to wrap the fetch function call into gevent.spawn, sort of parallelizing the URL download process. I also had to run monkey.patch_socket() which makes all socket calls cooperative. That way, while one URL is being downloaded and the program is waiting for response from a remote server, the gevent driver switches to the other tasks and tries to download other available documents instead of staying idle. All gevent threads in Python are performed sequentially, but since there is no wait time (no blocking system calls), it’s done faster.

The whole idea looks good, and most importantly, this approach suits our needs perfectly. However, the tarantool-python connector doesn’t work with gevent out of the box, so I had to implement my own gtarantool connector on top of it.

Gevent and Tarantool

The gtarantool connector works with gevent and Tarantool 1.6, and it’s available to install from pypi:

$ pip install gtarantool

Meanwhile, the new solution for our task looks like this:

import gevent
import gtarantool

import string

mod_len = len(string.printable)
data = [string.printable[it] * 1536
for it in range(mod_len)]
cnt = 0

def insert_job(tnt):
global cnt

for i in range(10000):
cnt += 1
tnt.insert(“tester”, (cnt, data[it % mod_len]))


tnt = gtarantool.connect("127.0.0.1", 3301)

jobs = [gevent.spawn(insert_job, tnt)
for _ in range(10)]

gevent.joinall(jobs)

What’s different here from the initial synchronous piece of code? We distributed the insertion of 100K records among ten asynchronous greenlets, each making about 10K tnt.insert calls per cycle; and we need to connect to Tarantool just once! The program runtime reduced to 12 seconds, which is almost 3 times faster than the synchronous version; and the number of data insertions into the database increased to 8,000 per second. Why does this scheme work faster? How does the magic work?

The gtarantool connector uses a buffer of requests to send to a Tarantool socket, and separate read/write greenlets for this socket. Let’s take a look at the results in a code profiler (this time I was using Greenlet Profiler, a version of the yappi profiler adapted for greenlets).

Analyzing the results in kcachegrind, we can see that the number of socket.recv calls reduced from 100K to 10K, and the number of socket.send calls decreased from 200K to 2.5K. Actually, that’s what makes the work with Tarantool more effective: fewer heavy system calls due to the lighter and “cheaper” greenlets. And the most important — and nicest — point is that our piece of code remained basically “synchronous”. It has no ugly twisted callbacks.

We successfully use this approach in our project. Here are some other pros:

  1. We don’t use forks. We can use several Python processes, with one gtarantool connection for each process (or connection pooling).
  2. A switch inside greenlets is quicker and more effective than a switch between Unix processes.
  3. The reduced number of processes allows for significantly reduced memory consumption.
  4. The reduced number of operations with the Tarantool socket boosts the efficiency of Tarantool itself as it doesn’t load the CPU that much.

What about Python 3 and Tarantool?

One of the differences among various asynchronous frameworks is their support for Python 3. For example, gevent doesn’t support it.

A jedi’s path is thorny. I really wanted to compare the asynchronous work with Tarantool using Python version 2 and 3. So I decided to rewrite everything for Python 3.4. You know, it was a little weird to write the code after Python 2.7:

  • print “foo” doesn’t work
  • all strings are objects of the str class
  • no long type

But I got a hold of it pretty quickly, and now I’m trying to write my Python 2.7 code in a manner compatible with Python 3 right away.

Here is yet another example of dissimilarity:

Python 3.4:

>>> a=dict()
>>> a[b"key"] = 1
>>> a["key"]
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
KeyError: 'key'

Python 2.7:

>>> a=dict()
>>> a[b"key"] = 1
>>> a["key"]
1

This kind of minor stuff is a good example of the difficulties with porting code to Python 3; even tests don’t always help here. Everything seems to work, but it works half as fast, and in our case that makes a big difference. So, we try to run our code in Python 3 — it works!

Well, it’s getting better! The synchronous connector completes the task averagely in 35 seconds, which is not as fast as Python 2.7, but we can live with that.

Moving to asyncio in Python 3

The asyncio package is an “out-of-the-box” collection of coroutines for Python 3. Here we have documentation, examples, and libraries for asyncio and Python 3. At first sight, asyncio looks complicating and confusing (at least in comparison with gevent); however, after a while everything falls into place. Not without difficulty, I created a Tarantool connector version for asyncio, aiotarantool.

This connector is already available to install from pypi:

$ pip install aiotarantool

With asyncio, our piece of code became a bit more complex. We got the yield from constructions and the @asyncio.coroutine decorators, but I like it overall, and it’s not too different from gevent:

import asyncio
import aiotarantool
import string

mod_len = len(string.printable)
data = [string.printable[it] * 1536
for it in range(mod_len)]
cnt = 0

@asyncio.coroutine
def insert_job(tnt):
global cnt

for it in range(10000):
cnt += 1
args = (cnt, data[it % mod_len])
yield from tnt.insert(“tester”, args)


loop = asyncio.get_event_loop()
tnt = aiotarantool.connect(“127.0.0.1”, 3301)

tasks = [asyncio.async(insert_job(tnt))
for _ in range(10)]

loop.run_until_complete(asyncio.wait(tasks))
loop.close()

It solves the task in 13 seconds on average (which is about 7.5K insertions per second), a bit slower than Python 2.7 and gevent, but much better than all synchronous versions. The aiotarantool connector has a minor but very important difference from the other libraries available at asyncio.org: the tarantool.connect call is made outside of asyncio.event_loop. In fact, this call doesn’t create any real connection: it’ll be created later, inside one of the coroutines at the first yield from tnt.insert call. This approach seemed easier and more convenient for me when coding with asyncio.

Again, here are the profiling results (I used the yappi profiler, but I have a concern whether it counts the number of function calls correctly when working with asyncio):

Here we have 5K calls of StreamReader.feed_data and StreamWriter.write. No doubt, that’s much better than the 200K socket.recv calls and 100K socket.sendall calls we had in the synchronous mode.

Comparing the approaches

Now let’s compare the efficiency of the synchronous and asynchronous Tarantool connectors. You can find our benchmark source code in the tests catalog of the gtarantool and aiotrantool libraries. The test scenario is to insert, select, update and delete 100,000 records (1.5 Kb each). We ran every test 10 times; average rounded figures are listed in the tables below. The precise values (they depend on the hardware) are not important, we need only their correlation.

We compared:

  • synchronous tarantool-python connector for Python 2.7;
  • synchronous tarantool-python connector for Python 3.4;
  • asynchronous gtarantool connector for Python 2.7;
  • asynchronous aiotarantool connector for Python 3.4.
Test runtime, in seconds (the less, the better)
Operations per second (the more, the better)

gtarantool is slightly faster than aiotarantool. We’ve been using gtarantool in our project for a while; this solution does really well with heavy workloads, however, gevent isn’t supported in Python 3. Besides, gevent is a third-party library that requires to be compiled during installation. The asyncio package in its turn is fast and new; it comes “out-of-the-box” in Python 3 and has no “duct tapes” like the monkey.patch utility in gevent. But as of now, aiotarantool hasn’t been used with the real workload in our project. Well, the night is still young!

Getting every last drop of performance out of the CPU

Let’s try to make our benchmark source code a tad more complex to get the most of our CPU. We’ll be simultaneously deleting, inserting, updating and selecting data (a fairly common type of workload) in one Python process. We’ll run several processes of the kind, say 22. The magic of this figure works as follows: on a machine with 24 cores, we’re leaving one core to the system (just in case), another core is for Tarantool (it’ll be fine with just one!), and the rest 22 cores are for our 22 Python processes. Let’s run the test for both gevent and asyncio. You can find our benchmark source code for gtarantool and for aiotarantool at GitHub.

For easier comparison, it’d be great to present our test results in a smart-looking way. That’s where we can appreciate the capabilities of the new Tarantool version 1.6: it’s actually a Lua interpreter, so we can write absolutely any Lua code right in the database. Let’s write the simplest program, and Tarantool can send the statistics to graphite. Now we add this piece of code to our Tarantool startup script (in a real project, of course, we’d put it into a separate module).

fiber = require('fiber')
socket = require('socket')
log = require('log')

local host = '127.0.0.1'
local port = 2003

fstat = function()
local sock = socket('AF_INET', 'SOCK_DGRAM', 'udp')
while true do
local ts = tostring(math.floor(fiber.time()))
info = {
insert = box.stat.INSERT.rps,
select = box.stat.SELECT.rps,
update = box.stat.UPDATE.rps,
delete = box.stat.DELETE.rps
}

for k, v in pairs(info) do
metric = 'tnt.' .. k .. ' ' .. tostring(v) .. ' ' .. ts
sock:sendto(host, port, metric)
end

fiber.sleep(1)
log.info('send stat to graphite ' .. ts)
end
end

fiber.create(fstat)

Now we start up Tarantool and automatically receive our statistics graphs. Cool! I liked this feature very much.

Now let’s run two benchmark tests. In the first one, we’ll be deleting, inserting, updating and selecting data. In the second one, we’ll be only selecting data. For all the graphs, the X axis is measuring the time, and the Y axis is measuring the number of operations per second.

gtarantool (insert, select, update, delete)
aiotarantool (insert, select, update, delete)
gtarantool (select only)
aiotarantool (select only)

Let me remind you that Tarantool was using only one core. In the first test, Tarantool loaded the core by 100%; in the second test, Tarantool used only 60% of its core.

Based on these results, we decided to adopt the above-described approaches in our project.

Conclusions

Examples in this article are kind of artificial, of course. Real-life tasks are more complex and diverse, but the solution is mostly the same as shown above: Tarantool and asynchronous programming. This tandem works great when you need to process “lots, and lots, and lots more requests per second”. Coroutines are effective when you have event wait (wait system calls); a classic example of an event wait task is a crawler.

Coding for asyncio or gevent is not as hard as it looks, still you have to pay much attention to code profiling: asynchronous code often doesn’t work the way it was expected.

Tarantool and its protocol are really well-suited for asynchronous programming. Once you immerse yourself into the world of Tarantool and Lua, you’ll be endlessly surprised at their powerful capabilities. Python code can work effectively with Tarantool; Python 3 is almost as good for developing with asyncio coroutines.

I hope that this article will be useful for the community and will enrich our knowledge of Tarantool and asynchronous programming. I think, asyncio and aiotarantool will eventually be used in production in our projects, and I’ll have more information to share.

Links used in this article

And, of course, the Tarantool connectors:

It’s about time to give them a try!