When I joined AdStage in the Fall of 2013 we were already running on Heroku. It was the obvious choice: super easy to get started with, less expensive than full-sized virtual servers, and flexible enough to grow with our business. And grow we did. Heroku let us focus exclusively on building a compelling product without the distraction of managing infrastructure, so by late 2015 we were running thousands of dynos (containers) simultaneously to keep up with our customers.
We needed all those dynos because, on the backend, we look a lot like Segment, and like them many of our costs scale linearly with the number of users. At $25/dyno/month, our growth projections put us breaking $1 million in annual infrastructure expenses by mid-2016 when factored in with other technical costs, and that made up such a large proportion of COGS that it would take years to reach profitability. The situation was, to be frank, unsustainable. The engineering team met to discuss our options, and some quick calculations showed us we were paying more than $10,000 a month for the convenience of Heroku over what similar resources would cost directly on AWS. That was enough to justify an engineer working full-time on infrastructure if we migrated off Heroku, so I was tasked to become our first Head of Operations and spearhead our migration to AWS.
It was good timing, too, because Heroku had become our biggest constraint. Our engineering team had adopted a Kanban approach, so ideally we would have a constant flow of stories moving from conception to completion. At the time, though, we were generating lots of work-in-progress that routinely clogged our release pipeline. Work was slow to move through QA and often got sent back for bug fixes. Too often things “worked on my machine” but would fail when exposed to our staging environment. Because AdStage is a complex mix of interdependent services written on different tech stacks, it was hard for each developer to keep their workstation up-to-date with production, and this also made deploying to staging and production a slow process requiring lots of manual intervention. We had little choice in the matter, though, because we had to deploy each service as its own Heroku application, limiting our opportunities for automation. We desperately needed to find an alternative that would permit us to automate deployments and give developers earlier access to reliable test environments.
So in addition to cutting costs by moving off Heroku, we also needed to clear the QA constraint. I otherwise had free reign in designing our AWS deployment so long as it ran all our existing services with minimal code changes, but I added several desiderata:
- Simple system administration: I’d worked with tools like Chef before and wanted to avoid the error-prone process of frequently rebuilding systems from scratch. I wanted to update machines by logging into them and running commands.
- Boring: I wanted to use “boring” technology known to work rather than try something new and deal with its issues. I wanted to concentrate our risk in our business logic not in our infrastructure.
- Zero downtime: Deploying on Heroku tended to cause our users to experience “blips” due to some user requests taking longer to run than Heroku allowed for connection draining. I wanted to be able to eliminate those blips.
- Rollbacks: If something went wrong with a deploy I wanted to be able to back out of it and restore service with the last known working version.
- Limited complexity: I was going to be the only person building and maintaining our infrastructure full-time, so I needed to scope the project to fit.
Knowing that Netflix managed to run its billion dollar business on AWS with nothing fancier than Amazon machine images and autoscaling groups, I decided to follow their reliable but by no means “sexy” approach: build a machine image, use it to create instances in autoscaling groups, put those behind elastic load balancers, and connect the load balancers to DNS records that would make them accessible to our customers and each other.
Thus I set out to build our AWS deployment strategy.
Becoming an AWS Sumo
When I’m engineering a system, I like to spend a lot of time up front thinking things through and testing assumptions before committing to a design. Rich Hickey calls this hammock driven development.
Our office doesn’t have a hammock, so I used our Sumo lounger instead.
Over the course of a couple months in the Spring of 2016 I thought and thought and put together the foundations of our AWS deployment system. It’s architecture looks something like this:
At it’s core is what we call the AdStage unified image. This machine image is used to create instances for all services in all environments, from development and test to staging and production. On it are copies of all our repos and the dependencies needed to run them. Depending on the values of a few instance tags, the instance can come up in different modes to reflect its usage.
When an instance comes up in “review” mode, for example, all the services and their dependent databases run together on that instance and talk to each other. This lets engineers doing development and QA access an isolated version of our full stack running any arbitrary version of our code. Whatever they do on these review boxes doesn’t affect staging or production and doesn’t interact with other review boxes, completely eliminating our old QA/staging constraint. And as an added bonus, as soon as a review box passes QA, it can be imaged and that image can be deployed into production.
That works because when an instance starts in “staging” or “production” mode it’s also told what service it should run. This is determined by tags the instance inherits from its autoscaling group, letting us bring up fleets of instances running the same code that spread out the load from our customers. For autoscaling groups serving web requests, they are connected to elastic load balancers that distribute requests evenly among our servers. The load balancers give us a fixed point we can smoothly swap out instances under, enabling zero-downtime deployments, and make rollbacks as easy as keeping old versions of the unified image on standby ready to swap in.
The AWS resources we use don’t fully coordinate themselves, though, so we wrote a Ruby Thor app that uses the AWS Ruby SDK to do that. It takes care of starting review boxes, building images, and then deploying those images into staging and production environments. It automatically verifies that deploys are working before switching the load balancers over to new versions and will suggest a rollback if it detects an issue after a deploy completes. It also uses some clever tricks to coordinate multiple deploys and lock key resources to prevent multiple engineers from corrupting each others’ deploys, so anyone can start a deploy, although they’ll be stopped if it would cause a conflict.
That covered all our desiderata: imaging instances allowed easy system administration, the setup was boring and widely used, there was zero downtime inherent in the deployment process, deployment was automated with support for rollbacks, and it wasn’t very complex at less than 1500 lines of code all in. And since it solved the QA constraint and by our estimates would save over $10k in operating expenses, all that remained was to plan the live migration from Heroku to AWS.
A Live Migration
July of 2016 was typical for San Fransisco. Most days the fog and frigid air kept me working inside while, across the street from our office, unprepared tourists shivered in shorts as they snapped selfies at Dragon’s Gate. It was just as well, because everything was set to migrate from Heroku to AWS, and we had a helluva lot of work to do.
Our customers depend on us to manage their ad campaigns, automate their ad spend, and report on their ad performance. When we’re down they get thrown back into the dark ages of manually creating and updating ads directly through the networks’ interfaces. They couldn’t afford for us to go offline while we switched over to AWS, so we were going to have to do the migration live. Or at least as live as reasonable.
We implemented a 1-week code freeze and found a 1-hour window on a Saturday morning when AdStage would go into maintenance mode while I switched databases and other services that couldn’t easily be moved while running. In preparation we had already performed migrations of our staging systems and written a playbook that I would use to cut over production. I used the code freeze to spend a week tweaking the AWS deployment to match the Heroku deployment. All seemed fine on Saturday morning. We went down, I cut over the databases, and then brought AdStage back up. I spent the day watching monitors and staying close to the keyboard in case anything went wrong, but nothing did. I went to sleep that night thinking everything was in a good state.
After a lazy Sunday morning, I started to get some alerts in the afternoon that our importers were backing up. As we looked into the issue the problem became quickly apparent: we somehow had less CPU on AWS than Heroku despite having nominally more compute resources. As a result we couldn’t keep up, and every hour we got further and further behind. We had to decrease the frequency of our imports just to keep the queues from overflowing, and we ultimately had to switch back on our Heroku apps to run alongside AWS to keep up with the workload. It was the opposite of saving money.
What we figured out was that Heroku had been telling us a happy lie. We were officially only getting about 2 ECUs per dyno, but the reality was that we were getting something closer to 6 since our neighbors on Heroku were not using their full share. This meant that our fleet of AWS instances was 3 times too small, and in spite of everything Heroku was actually cheaper! If only there were a way to pay less for more instances…
That’s when we hit upon using spot instances. We’d thought about using them because they’re about about 1/10th the price of on-demand, but they come with some risk because they can be terminated at any time if your reserve price falls below the auction price. Luckily, this doesn’t happen very often, and autoscaling groups otherwise take care of the complexity of managing spot instances for you. Plus it was easy to have backup autoscaling groups that use on-demand instances sitting in the wings, ready to be scaled up by a single command if we temporarily couldn’t get enough spot instances to meet our needs. We ultimately were able to convert about 80% of our fleet to spot instances, getting our costs back down within expected targets despite using 3 times more resources than originally expected.
Aside from our surprise underestimation of capacity, switching from Heroku to AWS went smoothly. Don’t get me wrong, though: it’s something that was worth doing only because we had reached a scale where the economics of bringing some of our infrastructure operations in house made sense. If we weren’t spending at least one engineer’s salary worth of money on opex that could be saved by switching to AWS and if infrastructure hadn’t become a core competency, we would have stuck with Heroku and had that person (me!) work on things more essential to our business. It was only because of changing economics and processes that migrating from Heroku to AWS became part of our story.
Interested in AdStage becoming part of your story? We’re hiring!
There is now a part 2.