How Putting AdStage in a Box Sent Us Off to the Races

Last time on the AdStage engineering blog I explained how we migrated from Heroku to AWS, and much of that explanation covered what our deployment system looks like. A key aspect of that system is our use of machine images — specifically our use of a single machine image for every server in our fleet across all environments. Of this machine image I previously said:

At it’s core is what we call the AdStage unified image. This machine image is used to create instances for all services in all environments, from development and test to staging and production. On it are copies of all our repos and the dependencies needed to run them. Depending on the values of a few instance tags, the instance can come up in different modes to reflect its usage.
When an instance comes up in “review” mode, for example, all the services and their dependent databases run together on that instance and talk to each other. This lets engineers doing development and QA access an isolated version of our full stack running any arbitrary version of our code. Whatever they do on these review boxes doesn’t affect staging or production and doesn’t interact with other review boxes, completely eliminating our old QA/staging constraint. And as an added bonus, as soon as a review box passes QA, it can be imaged and that image can be deployed into production.
That works because when an instance starts in “staging” or “production” mode it’s also told what service it should run. This is determined by tags the instance inherits from its autoscaling group, letting us bring up fleets of instances running the same code that spread out the load from our customers.

This design implies that the code we run when a server starts must contain much of the complexity of the deploy system, thus it deserves a closer look for those interested in setting up a deployment environment like ours. Let’s look at how we execute that code, how it’s controlled, and the process it goes through to eventually run our applications.

On Your Mark

When an instance starts we need a way to tell it what to do. AWS encourages the use of cloud-init, a simple service that runs a script at the end of startup, by preconfiguring it in many of their officially supported images to automatically execute an instance’s user data as a script. The advantage of this method is that you can constantly update your autoscaling groups to use the latest images from Amazon without having to do much system maintenance yourself, but at the cost of needing to setup machines from scratch every time you start one. That seemed like a reasonable tradeoff to us at first, but as our cloud-init script grew it quickly became difficult to work with.

The trouble is that to fully test your cloud-init script you have to start a new instance. For every change you make during the development cycle you’re looking at 2 to 5 minutes for an instance to start, add on however much time it takes your script to run, and only then can you log in and check to see if the script did what you expected. And as we put more and more of our code on the box, this meant installing more and more dependencies during every startup, so pretty soon it was taking 30 minutes to get a review box all the way up. Even if the cloud-init script worked flawlessly, this was much longer than we were willing to wait for an instance to become operational either for review or production, so we needed another solution.

The simplest alternative was to not install dependencies every time. Instead, we could have all that stuff already loaded on the image by maintaining our own custom image rather than using one of Amazon’s. Yes, this meant a bit more work on our end to keep the image up-to-date with patches, but in exchange we cut our startup times dramatically and gained a lot more control over our run environment. Once we made this change we realized there wasn’t a lot of reason to keep using cloud-init because we could store our startup scripts in the image rather than in the user data, so we also gave up cloud-init and switched to normal init scripts using runit.

Runit is a drop-in replacement for SysV init, similar to daemontools, but with numerous enhancements. It adds support for supervised daemon processes among other things, which means that rather than writing clunky SysV style init scripts that have to manage everything themselves, runit gives you tools to help your daemon processes restart themselves, start their dependencies, manage logs, and respond to signals. It took about an hour to perform the initial conversion to runit and that made it possible to rerun the startup script without restarting the instance.

Or at least it did in theory, because to get all the way to restartability the script needed to cleanly shut itself down, and to do that we’d have to make much of what was implicit about AdStage’s run environment explicit.

Get Set

To understand how we did this, it helps to know a bit about the history of AdStage. In production we ran on Heroku, and since Heroku is containerized and service oriented, it gave us process isolation, required services to communicate over remote sockets, and encouraged stateless execution. In development, however, we ran the full AdStage stack on our laptops where none of those features were enforced, so to avoid cries of “but it works on my machine”, we had to be careful about how we ran our services locally. This forced us to design an architecture able to leverage the dynamic scalability of the cloud while carefully ensuring the kind of isolation needed to run on mainframes and minicomputers.

So the good news was that we already knew how to run all of AdStage on a single box in a way that mimicked the way AdStage ran in production. The trouble was that we either depended on Heroku to provide isolation via containers or on our engineers being smart enough with their local configs to not let our services crash into each other. Our new system couldn’t use either of these methods, so it was time to write some code!

We could have used control groups or containers to achieve this, but because our services were capable of running side-by-side with only process isolation, we decided to eschew the extra complexity this would introduce and manage the processes directly. We began by creating isolated language runtime environments for each service. Since we did this in development using RVM and NVM, on our AWS instances we started out doing the same. After we added Elixir to our stack, we switched to ASDF because it supports more languages.

Next we needed to set environment variables because, following the 12-factor app guidelines, our configuration depends on them. Setting environment variables on a per process basis is as simple as VAR="VAL", and using the chpst utility included with runit we were able to load them from directories. This let us manage our configuration in a git repo, giving us a detailed history of our configs. We were also able to nest chpst -e [dir] calls inside each other so we could have a hierarchy of configs, allowing us to share common configs across environments and services. The biggest challenge we faced here was copying our configs from Heroku without making any mistakes.

Finally we needed a way to handle shutdown. As mentioned, runit has a facility for handling signals, so we used this to add scripts to tell our services how to shut down gracefully. Most of the time that’s just sending TERM to processes, but sometimes more cleanup is needed and this gave us a chance to do it. Combined with signal traps in the wrapper scripts around our application processes, we were able to get a system that could cleanly shut itself down and return the instance to a state where the startup script could run again as if it were on a new instance.

Go!

That got us to a point where we could run all of AdStage on an instance. Running code in production would then just be a matter of telling an instance to only run one service instead of all of them. But how would we do that? We probably could have used user data, but our instances don’t need a lot of information to figure out what they are supposed to do: they just need to know if they are review instances that need to run all services or production instances that should run one component within a service. Additionally, since we run many instances configured to start the same way as part of autoscaling groups, it would be nice if it were easy to find instances that were part of the same group by directly inspecting the instances rather than looking them up by autoscaling group. Thus tags were our perfect solution because they are small, easy to work with, and searchable within EC2.

Tags are also easy to read from scripts running on an instance because from every instance you can use the ec2-describe-tags command to get them. They come back as a JSON string you’ll need to parse, but once you do you can load them into variables in your scripts and use them for flow control. It’s then as simple as setting a few tags in our autoscaling group’s launch configs to create a fleet of self-configuring instances.

So that’s how we got AdStage to run on a single instance and how it enabled our AMI-based deploy system. In the year since we completed the transition our uptime numbers have soared, fewer bugs slip past QA, and we ship more features per engineer. The only thing holding us back is finding more great talent to join our team. Maybe that’s you?