We went down, so we wrote a better pure python memcache client

Mixpanel Eng
Jul 16, 2012 · 4 min read

Memcache is great. Here at Mixpanel, we use it in a lot of places, mostly to cache MySQL queries but also for other data stores. We also use kestrel, a queue server that speaks the memcache protocol.

Because we use eventlet, we need a pure python memcache client so that eventlet can patch the socket operations to be non-blocking. The de-facto standard for this is python-memcached, which we used until recently.

In which things fail

When customers send data to our /track API endpoint, it hits a load balancer which forwards it to an API server. The API server sticks the data on a queue and we process it “later” (usually in a couple of milliseconds). Our kestrel queues implement the same interface as memcached, so we simply .set() to enqueue and .get() to dequeue.

On April 16, 2012, our /track API endpoint went down. Part of the reason is that one of our kestrel servers went down, but that shouldn’t have been a problem — that’s why we have redundancy and failover. Unfortunately, failover only works when failure is detected. This is where we had problems.

It turns out that socket operations that fail in python-memcached don’t return an error or raise any exceptions. Calling .get() on a dead server returns None on failure, the same thing that is returned when there’s no value. As such, our API servers weren’t able to recover by marking the kestrel server as down and kept trying to enqueue events to it.

With python-memcached, it’s difficult to control how long the client should wait before timing out. When our API servers tried to .set() onto the down queue, it would take around 30 seconds before failing. By then, the load balancers had already given up and our poor API server would be marked down through no fault of its own.

These two issues combined caused us to lose most of the data sent during the 85 minutes it took to fix things. Data loss is a worst-case scenario for us, so we took this incident very seriously.

If you’re like us and memcache is a critical part of your infrastructure, you need to know when your servers are down — your client needs to throw an exception on failure. But even if you’re just using memcached for caching, it’s important that you have control over your timeouts — you don’t want your caches to become a performance bottleneck when things go south.

Enter memcache_client

So, we set out to write a memcache client that suited our particular needs. We gave each instance of the client a configurable connect_timeout and timeout. It’s not quite a drop-in replacement; most of our code didn’t have to be updated at all, but there were some changes that needed to be made.

The main differences are:

  • We only connect to a single memcached.
    We don’t support autosharding over slabs.
  • TCP sockets only.
    No support for UDP or UNIX sockets.
  • Bytestring keys and values only.
    No pickling, casting to str, or auto-encoding of unicode.
  • And, of course, timeouts.
    Timeouts get raised on get’s and set’s, so you’ll need to catch those if you want the old behavior.

An interlude about types

You may be wondering why we got rid of the automatic pickling of keys and values. Our library doesn’t support some operations like incr and we also dropped support for sharding, but that was because we didn’t need them. Enforcing a “bytestrings in, bytestrings out” rule, however, was a conscious decision.

First of all, we noticed that, within our own codebase, there were different conventions for using the client. Some call sites would cast everything to strs themselves, some would rely on the autopickling, and some were just blindly using it and it happened to work — most of the time.

We think it’s important to know exactly what data you’re putting in and what data you’re getting back. Our more disciplined approach means you don’t get subtle errors when some unicode gets automatically encoded. We enforce the calling convention with code in our library that looks like this:

The code is robust about things like non-fatal network errors but it is intentionally brittle about how you interact with our API. As a side effect, the keys and values you store will be accessible from other clients, Python or otherwise.

Where we stand now

memcache_client has been a critical part of our infrastructure for about two months now. It’s running on over 200 machines, speeding up queries on both our external API and internally. In addition, every event we’ve received — approximately 10 billion during this time period — has successfully passed through our memcache client to make it onto the kestrel queues.

Now that it’s been battle-tested, we’re ready to open-source it. If you have a python application and you use memcached, we hope you’ll check it out. It was a bit of work to integrate, but the integration revealed our inconsistent handling of types and the resulting code is both aware of and resilient to hardware and network failures.

Check out the code at https://github.com/mixpanel/memcache_client
And the documentation at http://mixpanel.github.com/memcache_client/

As always, patches welcome (also, a big thanks to the people who’ve submitted patches to our other projects).

Originally published at https://engineering.mixpanel.com on July 16, 2012.

Mixpanel Engineering

Stories from eng @ Mixpanel!

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store