Lessons learned writing highly available code

After more than two years at Imgur, I’ve had to learn a lot about the principles behind writing highly-available (but not AP) fault-resilient systems. While occasionally some systems go down, it’s the times that I wake up in the morning and come in to work only to realize that overnight, a safeguard we put in place automatically triggered, or the system caught an error and successfully recovered, that I am thankful for some good design principles. Here are a few of those things I’ve noticed in particular:

1. Put limits on everything.

That queue you have for batch processing items? Does it really need to be unbounded? Stick an upper bound on it — 1 million items, whatever it is, and start discarding the most recent or least recent items. When connecting to another service over the network, do you really need to block indefinitely? Add a timeout! Do connections to your server need to remain forever, or would five minutes be so long that you’re probably better off killing it? You get the picture.

There is a cron job that runs at Imgur called “long query killer,” and it scans the active queries on MySQL that are executing on behalf of users’ requests and checks how long they are running, killing them if it exceeds some threshold. Since PHP times out (max_execution_time) after 30 seconds, no query from a request should be running after a few minutes. That query killer has probably single handedly prevented lots of long nights. If it doesn’t exist, build it. Your future self will thank you.

2. Retry, but with exponential back-off.

Jim Gray wrote in Why Do Computers Stop and What Can Be Done About It that “most production software faults are soft. If […] the failed operation is retried, the operation will usually not fail the second time.” In the face of transient bugs, we can increase the availability of the system by adding retrying. However, if you are careless, then you can easily DDoS yourself! Follow rule (1) and only retry a fixed number of times, and make the system wait an increasing amount of time between each attempt, so that you spread the load more continuously.

3. Use supervisors and watchdog processes.

Erlang, a language that is frequently used for software with very strong availability requirements such as telecommunications, has a design pattern of supervisors: every task that the program performs is structured in a way that it runs under the watch of a scrutinizing task master. If the supervisor detects that the task has unexpectedly quit, it restarts it (similar to rule (2)) from a known good state. It wouldn’t make sense to retry something that will fail forever! Monit is a great tool that can automatically restart your web server or daemon process if it crashes as well, and is a great alternative for most languages, or the Erlang top-level VM if you are not using heart.

4. Add health checks, and use them to re-route requests or automate rollbacks.

As a developer, think about how to boil all the variables around your system into a single boolean “is this thing healthy? Is it working?” At Imgur we use ELB’s great built-in port monitoring health checks to quickly and automatically route around instances that are down.

5. Redundancy is more than just nice-to-have, it’s a requirement.

If you’re using cloud providers, then the instances can die at any time, making redundancy not just a nice-to-have, but a requirement for fault-tolerant systems.

My favorite email to receive. Thanks, Amazon.

Sometimes, Amazon is kind enough to give you two week’s notice as above. Other times, you are not so lucky; on one occasion, we received such a notice after the instance had been terminated (and after our PagerDuty and Nagios alarms told us).

6. Prefer battle-tested tools over the “new hotness”.

It can be tempting to reach for the newest tools that promise big new features. For example, CoreOS is blowing up the move in the DevOps community around lightweight LXC-style containers, but in my own testing a core component (FleetD) has a highly surprising flaw in how it handles scheduling in certain cases that could lead to service interruption. In my case I ended up working around it, only after hours of post-mortem debugging at 2:00am. New technologies frequently have unknown-to-you and unknown-to-them modes of failure.

Newer tools also tend to have immature facilities for running in production. For example, Golang lacks an official debugger, and until a few months ago, there was no debugger even in the open source community. Go’s runtime tracing and monitoring facilities don’t come even close to Java’s JMX and Erlang’s.

Thoughts? What’s your favorite uptime-increasing trick?