I’m not going to spend much time describing what gevent is. I think the one sentence overview from its web site does a better job than I could:
gevent is a coroutine-based Python networking library that uses greenlet to provide a high-level synchronous API on top of libevent event loop.
What follows are my experiences using gevent for an internal project here at Mixpanel. I even whipped up some performance numbers specifically for this post!
The main draw of gevent is obviously performance, especially when compared with traditional threading solutions. At this point, it’s pretty much common knowledge that past a certain level of concurrency doing I/O asynchronously vastly outperforms synchronous I/O in separate threads.
What gevent adds is a programming interface that looks very much like traditional threaded programming, but underneath does asynchronous I/O. Even better, it does all of this transparently. You can continue to use normal python modules like urllib2 to make HTTP requests and they’ll use gevent instead of the normal blocking socket operations. There are some caveats, but I’ll get back to those later.
For now, here’s the kind of performance improvement you can expect:
- Ignoring everything else, gevent outperforms a threaded solution (in this case paste), by a factor of 4.
- The number of errors rises linearly with the number of concurrent connections in the threaded solution (these were all connection timeouts. I could probably have increased the timeout interval, but from a user perspective extremely long waits are just as bad as failures). gevent has no errors until 10,000 simultaneous connections, or at least until somewhere north of 5,000 simultaneous connections.
- The actual requests completed per second were remarkably stable in both cases, at least until gevent fell apart in the 10,000 simultaneous connections test. I actually found this somewhat surprising. I initially guessed that requests per second would degrade at least a little bit as concurrency went up.
- The 10,000 simultaneous connections threaded test failed completely. I could have probably gotten this to work (seemed like something that some more ulimit tweaking could have solved), but I was mostly doing the test for fun so I didn’t spend any time on it.
- If this kind of thing interests you, we’re hiring. (Yeah, I just intermingled content and advertising.)
Here’s the python code I used for both tests:
For the client, I used Apache bench with the following options:
- -c NUM: where NUM is the number of simultaneous connections. This matched the number used on the server command line in each test.
- -n 100000: all tests were over 100,000 requests. In the graph above, errors are not a rate, but rather the actual number of failed requests out of 100,000.
- -r: continue even if there is a failure.
All tests were done with both client and server running on the same low-end, 512MB Rackspace Cloud VPS. I initially thought I would need some way to limit the threaded solution to one CPU, but it turns out even though there are “four” cores on the VPS, you’re limited to 100% of one core. Not impressed.
Linux tweaks for load testing
I ran into a whole host of issues getting Linux working past ~500 connections per second. Almost all of these are related to all the connections being between the same two IP addresses (127.0.0.1 <-> 127.0.0.1). In other words, you probably wouldn’t see any of these problems in production, but almost certainly would in a test environment (except maybe if you’re running behind a single proxy).
- Increase the client port range
echo -e '1024t65535' | sudo tee /proc/sys/net/ipv4/ip_local_port_rangeThis increases the number of available ports to use for client connections. You’ll run out of ports very quickly without this (they get stuck in TIME_WAIT).
- Enable TIME_WAIT recycling
echo 1 | sudo tee /proc/sys/net/ipv4/tcp_tw_recycleThis helps with connections stuck in TIME_WAIT as well and is basically required past a certain number of connections per second at least if the IP address pair remains the same. There’s another option tcp_tw_reuse that is available as well, but I didn’t need to use it.
- Disable syncookies
echo 1 | sudo tee /proc/sys/net/ipv4/tcp_syncookiesIf you see “possible SYN flooding on port 10001. Sending cookies.” in dmesg, you probably need to disable tcp_syncookies. Don’t do this on your production server, but for testing it doesn’t matter and it can cause connection resets.
- Disable iptables if you’re using connection trackingYou’ll quickly fill up the netfilter connection table. Alternatively, you try increasing
/proc/sys/net/netfilter/nf_conntrack_max, but I think it's easier just to disable the firewall while testing.
- Raise open file descriptor limitsAt least on Ubuntu, the open files limit for normal users defaults to 4096. So, if you wan to test with more than ~4000 simultaneous connections you need to bump this up. The easiest way is to add a line to
* hard nofile 16384" and then run
ulimit -n 16384before running your tests.
It can’t be all good, right? Right. Actually, most of the problems I had with gevent could be solved with better, more thorough documentation, which leads me to:
Simply put: it’s not good. I probably read more gevent source code than I did gevent documentation (and it was more useful!). The best documentation is actually in the examples directory in the source tree. If you have a question, look there first — seriously. I also spent more time googling through mailing list archives than I like to.
I’m specifically talking about eventlet here. In retrospect, this makes sense, but it can lead to some baffling failures. We had some MongoDB client code that was using eventlet. It simply didn’t work from the server process I was working on using gevent.
Order matters. Ugh.
Daemonize before you import gevent or at least before you call
monkey.patch_all(). I didn't look into this deeply, but what I gathered from a mailing list post or two is that gevent modifies a socket in python internals. When you daemonize, all open file descriptors are closed, so in children, the socket will be recreated in its unmodified form, which of course doesn't work right with gevent. gevent should handle this type of thing or at least provide a daemonize function that is compatible.
Monkey patching. Sometimes?
So, most operations are patched by executing
monkey.patch_all(). I'm not a huge fan of doing this sort of thing, but it is nice that normal python modules continue to function. Bizarrely, though, not everything is patched. I spent a while trying to figure out why signals weren't working until I found gevent.signal. If you're going to patch some functions, why not patch them all?
The same applies to gevent.queue vs. standard python queue. Overall, it needs to be clearer (as in a simple list) when you need to use gevent specific API’s versus standard modules/classes/functions.
gevent has no built in support for multiprocessing. This is much more a deployment issue than anything else, but it does mean that to fully utilize multiple cores, you’re going to need to run multiple daemon processes on multiple ports. Then, most likely, you’re going to need to run something like nginx (at least if you’re serving HTTP requests) to distribute requests among the server processes.
Really, the lack of multiprocessing capability just means another abstraction layer on your server that you might have added anyway for availability.
It’s a bigger issue when using gevent for client load testing. I ended up implementing a multiprocess load client that used shared memory to aggregate and print statistics. It was a lot more work than it should have been. (If anyone’s doing something similar, ping me and I can send the shell of the client program.)
If you’ve gotten this far, you noticed that I spent two full sections on negative aspects of gevent. Don’t let that fool you though. I’m convinced that gevent is a great solution for high performance python networking. There are problems, but mostly they’re problems with documentation, which will only improve with time.
We’re using gevent internally. In fact, our server is so efficient that we’ll easily run out of bandwidth resources before computing resources (both processors and memory) for the VPS size we’re using.
Originally published at https://engineering.mixpanel.com on October 29, 2010.