Some of the most useful lessons in life come out of failures. This includes building web properties that handle service outages and large bursts of traffic. Here are two patterns I’ve picked up that work really well in these scenarios.

ASYNCHRONOUS QUEUES

The pattern described here comes from watching the work of folks like Scott VanDenPlas and Nick Leeper. It’s become a standard part of my toolkit and one I recommend to clients a lot with good success.

One of the big stress points in any large-scale web property is its database, especially when you’re dealing with a relational database like MySQL. Generally speaking, scaling up database read operations is easy, but scaling up writes can be challenging. There are always tradeoffs when tackling how to manage lots of data writes. What if instead of writing directly to the database when we have new data, we instead put that data into a queue to be added later? Bear in mind that we’re generally talking about milliseconds later, though one of the benefits of this structure is being able to pause new writes when needed.

SETTING UP A QUEUE

The first part of this system is the queue itself. There are many systems out there for this, but the one I generally use is Amazon’s Simple Queue Service. SQS is relatively trivial to set up. Your service takes a data payload, which is arbitrary text, and saves it to SQS. This can be any format that makes sense for your application; I generally use JSON data.

Let’s say, for instance, you’re running a messaging service like Twitter. When a user writes a new message, your API takes that post and its associated metadata and pops it into the queue. Now, that new message isn’t immediately available for viewing, but that’s OK. It will be momentarily, once a worker processes it.

WORKER SCRIPTS

Workers are where the work really gets done. They’re minimal scripts, written in whatever language works best for you, that grab items off the queue and do something with the data. In our example scenario, a worker requests new items from SQS, sees a new message and writes it to the database. Then it waits for more new messages. In an ideal world, that happens nearly instantaneously, but the real power here is in what happens when the conditions are less than ideal.

WHEN THE WORLD COMES CRASHING DOWN

Let’s say everything in your queued system is working as expected. Users post new messages, a worker script sees those and puts them in the database. Then your database crashes. In a direct-write system, your service is offline and users are seeing service errors. But in our queued system, users keep posting and the queue keeps receiving data. You put up a message to users that new posts are delayed because of a problem, then you get to work fixing the database.

Step one is to pause the workers. You don’t want them hammering an already suffering database. Then you fix whatever issue took the DB server down. Now, you begin bringing the workers back online. As your system recovers, the messages in the queue get put in the database where they belong. No data was lost, and that’s the whole point here.

This happened one night at the Obama campaign. The database that handled donation transactions was nearing its limits. Our database administrator Jay told us that we had a few minutes to act before it went down hard. We needed to move to a larger server, which should take only a few minutes. But this night was the vice-presidential debate. We were already handling lots of donations and the debate was about to end, meaning we’d be sending a full list fundraising email. This was about to be a costly disaster.

But our payment system had implemented the same kind of queue we’ve been talking about. We were fairly certain we could do the maintenance without any significant risk. Jay took the database offline and the DevOps team got the new server booting up. This felt like the longest 8 minutes of my life and I was merely watching, but everything worked as architected. The new server came online, we set the queue workers to start putting in the new data, and within a few minutes they’d caught up. Crisis averted.

STATIC SITES ARE SIMPLE TO SCALE

Just a few days earlier, on the first afternoon of the Democratic National Convention, the service we were using to power our live blog went down. We needed a better solution, and we needed it fast. That night I put together a simple Django site that served as the CMS for a new live blog. Instead of having our JavaScript query the Django system directly, I built the CMS to copy out all the data to a static JSON file served on Amazon’s Simple Storage Service. It worked like a charm. S3 cost us much less than scaling up a whole server system would have, and it never blinked under the load. We ran a single EC2 micro instance for the admin site and that was it.

WHATEVER CAN BE STATIC SHOULD BE

This wasn’t the first time we’d relied on S3 to reduce our footprint. Earlier in the year we’d moved our donate pages to Jekyll and seen significant improvements in page speed, conversion rates and our ability to test new ideas. Kyle Rush has covered this transition in detail on his blog and in a talk at Velocity. Not only were we using S3 to make hosting simple, but we put a CDN in front of that for even better performance.

We used S3 for our static assets like CSS and JavaScript for the entirety of the campaign, which were obvious choices, and continued to find more ways to leverage static files. When we needed to use data out of a vendor to display on a page, we’d set up a worker script running on a cron job (a way to schedule a computer to do work on regular intervals) to copy that data over to a JSON file on S3. This took the load off of the vendor and gave us confidence that the data would always be there. When we decided to test whether using a person’s ZIP code to pre-fill their city and state on a form could improve conversions, we built out 38,000 separate JSON files to handle the data lookups. You can grab those files on my Github page.

Get Out the Vote alert on BarackObama.com

This kind of thinking permeated our teams. On Election Day, everything that we could make static was static. The banners (pictured above) that alerted folks to changes in voting guidelines or let them know where to go vote followed the same CMS -> JSON pattern that our live blog used, which also powered the homepage for the day. As Scott put it, we weren’t necessarily revolutionary, we were just making lots of smart decisions everyday.


Developing a culture that understands where systems can fail, and testing for those failures, will help you build resilient web apps that can handle most situations. If your project serves a smaller audience, these suggestions may seem like overkill, but they can reduce costs and make you ready for the day when the Internet decides to come crashing at your door.

I’ll be giving a related talk on front-end performance as part of the Code & Creativity series on October 1st. For more information, check out the Eventbrite page.