Memcache is great. Here at Mixpanel, we use it in a lot of places, mostly to cache MySQL queries but also for other data stores. We also use kestrel, a queue server that speaks the memcache protocol.
Because we use eventlet, we need a pure python memcache client so that eventlet can patch the socket operations to be non-blocking. The de-facto standard for this is python-memcached, which we used until recently.
In which things fail
When customers send data to our /track API endpoint, it hits a load balancer which forwards it to an API server. The API server sticks the data on a queue and we process it “later” (usually in a couple of milliseconds). Our kestrel queues implement the same interface as memcached, so we simply .set() to enqueue and .get() to dequeue.
On April 16, 2012, our /track API endpoint went down. Part of the reason is that one of our kestrel servers went down, but that shouldn’t have been a problem — that’s why we have redundancy and failover. Unfortunately, failover only works when failure is detected. This is where we had problems.
It turns out that socket operations that fail in python-memcached don’t return an error or raise any exceptions. Calling .get() on a dead server returns None on failure, the same thing that is returned when there’s no value. As such, our API servers weren’t able to recover by marking the kestrel server as down and kept trying to enqueue events to it.
With python-memcached, it’s difficult to control how long the client should wait before timing out. When our API servers tried to .set() onto the down queue, it would take around 30 seconds before failing. By then, the load balancers had already given up and our poor API server would be marked down through no fault of its own.
These two issues combined caused us to lose most of the data sent during the 85 minutes it took to fix things. Data loss is a worst-case scenario for us, so we took this incident very seriously.
If you’re like us and memcache is a critical part of your infrastructure, you need to know when your servers are down — your client needs to throw an exception on failure. But even if you’re just using memcached for caching, it’s important that you have control over your timeouts — you don’t want your caches to become a performance bottleneck when things go south.
So, we set out to write a memcache client that suited our particular needs. We gave each instance of the client a configurable connect_timeout and timeout. It’s not quite a drop-in replacement; most of our code didn’t have to be updated at all, but there were some changes that needed to be made.
The main differences are:
- We only connect to a single memcached.
We don’t support autosharding over slabs.
- TCP sockets only.
No support for UDP or UNIX sockets.
- Bytestring keys and values only.
No pickling, casting to str, or auto-encoding of unicode.
- And, of course, timeouts.
Timeouts get raised on get’s and set’s, so you’ll need to catch those if you want the old behavior.
An interlude about types
You may be wondering why we got rid of the automatic pickling of keys and values. Our library doesn’t support some operations like incr and we also dropped support for sharding, but that was because we didn’t need them. Enforcing a “bytestrings in, bytestrings out” rule, however, was a conscious decision.
First of all, we noticed that, within our own codebase, there were different conventions for using the client. Some call sites would cast everything to strs themselves, some would rely on the autopickling, and some were just blindly using it and it happened to work — most of the time.
We think it’s important to know exactly what data you’re putting in and what data you’re getting back. Our more disciplined approach means you don’t get subtle errors when some unicode gets automatically encoded. We enforce the calling convention with code in our library that looks like this:
The code is robust about things like non-fatal network errors but it is intentionally brittle about how you interact with our API. As a side effect, the keys and values you store will be accessible from other clients, Python or otherwise.
Where we stand now
memcache_client has been a critical part of our infrastructure for about two months now. It’s running on over 200 machines, speeding up queries on both our external API and internally. In addition, every event we’ve received — approximately 10 billion during this time period — has successfully passed through our memcache client to make it onto the kestrel queues.
Now that it’s been battle-tested, we’re ready to open-source it. If you have a python application and you use memcached, we hope you’ll check it out. It was a bit of work to integrate, but the integration revealed our inconsistent handling of types and the resulting code is both aware of and resilient to hardware and network failures.
As always, patches welcome (also, a big thanks to the people who’ve submitted patches to our other projects).
Originally published at https://engineering.mixpanel.com on July 16, 2012.