The 9 Circles of Deployment Hell

Tips for shipping software

Marvin Li
10 min readJun 16, 2014

Deploying code is paramount for a developer, because it’s how we bring value to our products and services. As your team and infrastructure grows, shipping code can become either a healthy challenge or an absolute nightmare, depending on the choices you make.

In a former life, I lived in software deployment hell. I heard engineers cry in agony as they released code over and over, not knowing if their build would succeed, or crash and burn. Do not abandon all hope, ye who enter here. Take a journey with me through the nine circles of deployment hell, each one more punishing than the last, and you too can emerge to find salvation! Because the path to paradise begins in hell.

Circles[0] — No Infrastructure Automation

The first circle is the lack of infrastructure automation. Web applications often need specific versions of operating systems, web servers, programs, and libraries to run. They also need databases, load balancers, and other appliances. Collectively, these things are the infrastructure. And they need installation and configuration. This can be done either manually, or via automation.

It’s extremely time-consuming to manually set up application servers. Even worse, the most meticulous humans make mistakes. Tools like Chef and Puppet let you automate your infrastructure by describing it with code. My team at Condé Nast recently rigged up Chef to configure our auto-scaling AWS environment, and used Capistrano to deploy code to it.

PaaS services like Heroku and Engine Yard come pre-baked with infrastructure automation that you can manage directly from a website or the command line. Google App Engine takes it one step further by managing your resources for you. Of course, there’s no need to automate if you only have a few servers. Once you grow, though, invest in automation. With the plethora of options available today, there’s no reason for infrastructure to slow down your deployment.

Circles[1] — Waiting Too Long to Integrate

“Integrate early and often.” That’s the mantra of continuous integration. The reasoning is simple: If your software already works and never changes, you have little risk of it breaking. Changing your software introduces risk. Really big changes introduce a lot of risk. But making small, incremental, frequent changes is far less risky, and it makes problems easier to isolate.

Before I knew better, I used to be on three-week release cycles. My team and I kept a release branch of the code in production, along with an integration branch in which we checked in new work. But our integration branch was not stable. Meanwhile, production issues popped up, so we would make bug fixes on the release branch. At the end of each release cycle, we would scramble to get everything to work in the integration branch, and then merge the changes into release. But sometimes we would have a bad merge, and the bug fixes made in the release branch would be lost. It was a mess!

By contrast, we now have one master branch, and we integrate code every single day. We make sure master is in production-ready shape. It seldom breaks. We deploy code to production from master multiple times a day. When we work on big, time-consuming features, we will integrate working pieces of it into master before the entire feature is done. The difference is very dramatic! Instead of integrating three weeks’ worth of code, we’re integrating a few hours’ worth at a time.

Circles[2] — No Staging Environment

Without a proper staging environment, things can go haywire faster than you can say, “Let us descend into the blind world.” Sometimes, a feature is tested on a developer’s local environment and works well, but then it ends up behaving very poorly on production. There are bound to be differences between your dev and production environments.

For instance, your production servers may be behind a load balancer, but you have a single VM in your local environment. You may have different cache settings or an HTTP reverse proxy like Varnish in production. All of these differences can potentially introduce unexpected behavior.

If you have staging set up similarly to production, you can deploy there first as a sanity check. Our staging environment is a mini version of our production environment. We have fewer servers in staging, but they’re configured the same way. Occasionally, we ramp up staging to match production when we want to perform a load test to find the breaking point of our production environment.

Circles[3] — No Performance Testing

Load tests are an important part of performance testing. When you have lots of users or a resource-intensive web application, you have to start measuring performance. Run load tests when you are introducing major refactoring or new resource intensive features.

We have a script that downloads and parses our production server log files and generates a list of requests. Once you have a list of requests, you can use a tool like JMeter or LoadRunner to play them back in your staging environment. You set an SLA for yourself, perhaps a 50 ms response time. Just keep generating more traffic against your environment until you exceed your SLA. At that point, you will know your app’s maximum throughput.

If you can’t meet your expected traffic and maintain your SLA, go back to the drawing board and optimize more. To help you do this, there are great performance profilers that you can use. These tools measure the execution time of various chunks of your code, and they show you the hot spots where you should focus your energy.

Circles[4] — Manual Deployment

In this circle of deployment hell, developers carry around a lengthy checklist of things they do every time they deploy code. The list may involve logging into production machines, or running deployment scripts. The most offensive checklists call for editing configuration files, mucking around with hardware, and running data migrations.

This is dangerous, because humans make mistakes. Deployment mistakes can be severe and cause downtime. If you plan on deploying often, and it takes 30 minutes to run your scripts, deployment could consume the bulk of your day.

It’s critical to have a fully automated, reliable deployment from the start. Ideally, you should either run a single command from a command line, or click a single button. Continuous integration purists even forego the button and deploy to production automatically as soon as all tests pass—which brings us to our next circle, focused on testing.

Circles[5] — Manual Testing

Out of all the circles, making the choice between manual and automated testing probably has the biggest impact on your deployment time.

How do you test? I used to outsource QA responsibilities to an offshore team in India. They would spend their days on the site, trying to break things. Every time we built a feature, they would add a new test scenario. We didn’t often retire features, so, as our feature set grew, our testing time grew. Eventually it took several days for a team of four to test everything!

How do you scale manual testing? You can add more people. But if you did this, you would eventually have more testers than developers. You could extend your QA time. But this would unfortunately mean your release cycle lengthens exponentially.

So how do you automate testing? First, you need test coverage. Write unit tests that ensure correct behavior of individual methods of a given class. Next, add an integration test to ensure that your features are working. If you manage an ecommerce site, for example, this could include tests for registration and shopping cart checkout. Maybe throw in a few smoke tests to ensure that you’re getting a 200 response for key pages. If you’re starting with zero tests, add tests incrementally.

Finally, use a continuous integration server like Jenkins to run all of your tests whenever anyone on your team checks in new code. With automated testing, a three-day manual QA process can be reduced to fewer than 15 minutes. Take it from someone who’s been on both sides—the difference is staggering.

Circles[6] — No Monitoring

If one of your servers is engulfed by unholy hellfire, will you know about it? Without monitoring, you might not. Deployment without monitoring is like dropping your code into a dark abyss, not knowing what will happen to it.

If you want to have some control over your work, monitoring is critical. To start, you need transparency into your server-side performance. This is where application performance monitoring (APM) comes in handy. One of the most popular APM tool is New Relic. A shameless plug: I used to work on a leading .NET APM called LeanSentry. APM tools are designed to be very lightweight, but collect a wealth of information. They give you stack traces of uncaught exceptions, and show you things like memory consumption, CPU utilization, and database latency.

Secondly, you need to make sure everything is healthy on the client side, too. We manage video sites, so we want to know when our video player errors or takes too long to load. We use statsd to measure this. We use Chartbeat and Google Analytics to get a great real-time view of traffic against our sites. To ensure that we get emergency alerts when we’re away from work, PagerDuty emails, texts, and calls us when critical issues are detected.

Finally, we log every single imaginable event with Elasticsearch. It lets you search, slice, dice, and aggregate events. If something bizarre happens that defies explanation, this will help us figure it out.

Circles[7] — No Version Control

We are now reaching the deepest, darkest bowels of deployment hell. Believe it or not, I have heard horror stories about engineers at billion-dollar companies who shell into production servers to fidget with code, instead of using version control. Every time someone does that, evil wins.

Don’t let evil win. Use version control. Version control is the cornerstone of deployment. It’s where continuous integration starts. Everything goes into your version control system: your infrastructure scripts, your environment configuration, your automated tests, and of course, your application code. The only exception is your sensitive passwords.

Besides the obvious reasons of using version control, there are some critical benefits to deployment. In some cases, you need to roll back a deployment when something goes horribly wrong. Version control will have a snapshot of your last good build. To find out what went wrong, we can look through commits in our Git repository. We use git blame to see when and why the offending code was changed, and who we should talk to about it.

Circles[8] — Bad Communication

The innermost circle of fiery, satanic deployment hell is reserved for teams who fail to communicate. Poor communication is a leading cause of bad deployments. “What do you mean we had to run the schema update?!” Over-communicate, and don’t make assumptions! If a software engineer plans to ship some high-risk code, it’s his or her responsibility to let everyone know. Other team members can then pitch in to run load tests, review the code, and help watch for regression in the deployment. Do not tolerate cowboys who take risks recklessly.

Occasionally, my team gathers together for demos, where questions are asked and challenges are discussed. It’s important for everyone to see what the rest of the team is working on. When you’re in the weeds implementing a feature, sometimes you miss things that are obvious to someone who has some distance.

We also send out release notes to our company, so everyone knows what’s going on. We ask everyone to dogfood the product. We truly believe that quality is everyone’s responsibility, and people respond well to our involving them. They send us bug reports, and when issues arise, people are not afraid to tell us.

Tour Complete!

Deployment hell is not a fun place to be. It’s a good thing that we were just visiting. Shipping code does not need to be scary. It can be totally awesome. What are some lessons learned? Make your deployment fast and reproducible through automation. Once it’s automated, deploy often. Do your best to make your environment transparent. And lastly, teamwork makes the dream work.

A mighty flame follows a tiny spark. Do one thing this week to improve your deployment process!

Cross-post: http://marvinli.com/2014/06/16/9-circles-deployment-hell/

--

--