Some of the most useful lessons in life come out of failures. This includes building web properties that handle service outages and large bursts of traffic. Here are two patterns I’ve picked up that work really well in these scenarios.
The pattern described here comes from watching the work of folks like Scott VanDenPlas and Nick Leeper. It’s become a standard part of my toolkit and one I recommend to clients a lot with good success.
One of the big stress points in any large-scale web property is its database, especially when you’re dealing with a relational database like MySQL. Generally speaking, scaling up database read operations is easy, but scaling up writes can be challenging. There are always tradeoffs when tackling how to manage lots of data writes. What if instead of writing directly to the database when we have new data, we instead put that data into a queue to be added later? Bear in mind that we’re generally talking about milliseconds later, though one of the benefits of this structure is being able to pause new writes when needed.
SETTING UP A QUEUE
The first part of this system is the queue itself. There are many systems out there for this, but the one I generally use is Amazon’s Simple Queue Service. SQS is relatively trivial to set up. Your service takes a data payload, which is arbitrary text, and saves it to SQS. This can be any format that makes sense for your application; I generally use JSON data.
Let’s say, for instance, you’re running a messaging service like Twitter. When a user writes a new message, your API takes that post and its associated metadata and pops it into the queue. Now, that new message isn’t immediately available for viewing, but that’s OK. It will be momentarily, once a worker processes it.
Workers are where the work really gets done. They’re minimal scripts, written in whatever language works best for you, that grab items off the queue and do something with the data. In our example scenario, a worker requests new items from SQS, sees a new message and writes it to the database. Then it waits for more new messages. In an ideal world, that happens nearly instantaneously, but the real power here is in what happens when the conditions are less than ideal.
WHEN THE WORLD COMES CRASHING DOWN
Let’s say everything in your queued system is working as expected. Users post new messages, a worker script sees those and puts them in the database. Then your database crashes. In a direct-write system, your service is offline and users are seeing service errors. But in our queued system, users keep posting and the queue keeps receiving data. You put up a message to users that new posts are delayed because of a problem, then you get to work fixing the database.
Step one is to pause the workers. You don’t want them hammering an already suffering database. Then you fix whatever issue took the DB server down. Now, you begin bringing the workers back online. As your system recovers, the messages in the queue get put in the database where they belong. No data was lost, and that’s the whole point here.
This happened one night at the Obama campaign. The database that handled donation transactions was nearing its limits. Our database administrator Jay told us that we had a few minutes to act before it went down hard. We needed to move to a larger server, which should take only a few minutes. But this night was the vice-presidential debate. We were already handling lots of donations and the debate was about to end, meaning we’d be sending a full list fundraising email. This was about to be a costly disaster.
But our payment system had implemented the same kind of queue we’ve been talking about. We were fairly certain we could do the maintenance without any significant risk. Jay took the database offline and the DevOps team got the new server booting up. This felt like the longest 8 minutes of my life and I was merely watching, but everything worked as architected. The new server came online, we set the queue workers to start putting in the new data, and within a few minutes they’d caught up. Crisis averted.
STATIC SITES ARE SIMPLE TO SCALE
WHATEVER CAN BE STATIC SHOULD BE
This wasn’t the first time we’d relied on S3 to reduce our footprint. Earlier in the year we’d moved our donate pages to Jekyll and seen significant improvements in page speed, conversion rates and our ability to test new ideas. Kyle Rush has covered this transition in detail on his blog and in a talk at Velocity. Not only were we using S3 to make hosting simple, but we put a CDN in front of that for even better performance.
This kind of thinking permeated our teams. On Election Day, everything that we could make static was static. The banners (pictured above) that alerted folks to changes in voting guidelines or let them know where to go vote followed the same CMS -> JSON pattern that our live blog used, which also powered the homepage for the day. As Scott put it, we weren’t necessarily revolutionary, we were just making lots of smart decisions everyday.
Developing a culture that understands where systems can fail, and testing for those failures, will help you build resilient web apps that can handle most situations. If your project serves a smaller audience, these suggestions may seem like overkill, but they can reduce costs and make you ready for the day when the Internet decides to come crashing at your door.
I’ll be giving a related talk on front-end performance as part of the Code & Creativity series on October 1st. For more information, check out the Eventbrite page.