This is the chronicle of a multi-year odyssey of transforming a 40 person development team and 7 year old Ruby on Rails code base from a monthly deploy cycle to continuous deployment. This is not the typical “sunshine and rainbows” post about how easy it is to do continuous deployment when starting a project from scratch with 5 developers. This is what it really looks like. It’s slow. It’s painful. There was wasted effort. There are still things that are not quite right. But, it works. And it is infinitely better, faster, and more pleasant than it was before.
When I started at Invoca, almost three years ago, the team was doing an enormous number of things very well. They had:
- Religiously practiced Test Driven Development from day one. The extensive test suite hosted on SemaphoreCI running on each push to GitHub greatly reduced defect rates.
- Virtualized staging environments. These could be spun up or down in about 20 minutes making it possible for QA and developers to manually integration test features in an environment that was similar to production.
- Semi-automated deployments. Operations team members could, using a few commands, executed on a properly configured development machine, deploy the updated software to production and run database migrations.
- Detailed manual integration test plans. The plans, which covered the majority of the popular features of the application, were run before each deploy ensuring that platform killing bugs very rarely made it into the wild.
- The production system was heavily instrumented with thousands of metrics plumbed into a Graphite server that powered a home grown alerting system.
The team had been running a reliable, albeit labor intensive, development and deployment process for several years. Deploys happened once a month, after business hours, following almost a week of extensive manual integration testing. The monthly deploys often included thousands of lines of code encompassing multiple major features and dozens of bug fixes. A large “deploy plan” document was created for each deploy that listed all the items being shipped, checks to be executed after the deploy, and which team members would be involved. Before each deploy, an enormous “Go/No-Go” meeting was held with all the technical and business stakeholders. This process caused a great deal of problems:
- We could not react quickly to customer needs, particularly in addressing important, but not critical, bug fixes.
- The monthly cycle created enormous pressure to ship a feature, even if it wasn’t quite ready, to avoid waiting another month.
- At the “Go/No-Go” every item in the deploy was scrutinized for excess risk, often leading to 11th hour decisions to postpone the deploy, modify the code, or remove an item from the deploy. These changes themselves were often risky and stress inducing.
- The QA team had little to do at the start of the month, then was absolutely slammed at the end. This pattern wasted productivity and increased stress.
- We had to lock users out of the system during deploys that contained database migrations. (With a monthly cycle, virtually all deploys contained migrations.)
- When things went wrong, it was very difficult to tell which change introduced the bug. Because all of the changes were coupled together, the pressure to find and fix the problem, rather than roll back to the previous stable version, was intense.
- Critical bug fixes would be shipped in between the monthly deploys, without rigorous integration testing, occasionally leading to new bugs being shipped along with the fix.
Reacting to all of the above, we felt we could alleviate some of the pain by deploying twice a month. We created what we called the “early deploy” which would occur mid-month, as opposed to the main deploy which would occur at the end of the month. While this allowed us to react to customer needs faster, and reduced the size of our deploys, it obviously doubled the pain for QA and operations. This was the point at which the team decided we could not carry on as-is. Major changes were needed to meet the business’ needs while maintaining the team’s sanity.
Automated Integration Testing
We identified the pre-deploy manual testing as the pain point we wanted to tackle first. Automating the manual testing was the obvious solution, but, doing so was not trivial. After several months of effort, the QA team created a test suite using RSpec, Capybara and Selenium that could be run against one of the virtualized staging environments. At first the suite only covered a portion of the manual testing, but we kept at it until, almost nine months later, all of the manual integration tests had been automated. The suite was painfully slow and full of phantom failures caused by race conditions. But, a well trained operator, could re-run the failed tests and get a “green build” in about a day. This was a massive milestone: the semi-monthly week long QA death march was over.
Now that we could deploy with as little as 24 hours notice, we hit two more major problems:
- The sprint based Scrum process was based around fixed length cycles. We had shoe horned mid-sprint “early” deploys into the Scrum process, but it could not handle arbitrary, or even weekly deploys.
- Our customers required several days notice if we were going to lock them out during a deploy. This limited our ability to deploy migrations, often leading to delays shipping features that did not require migrations, but had been merged with other features that did.
Scrum + Kanban = Scrumban
We knew that monthly sprints were incompatible with continuous deployment. Weekly sprints were an option, but the overhead of such short cycles would kill productivity. Inspired by the streamlined flow of Kanban, we re-imagined Scrum into a lighter weight process suited to continuous deployment. After a fair amount of experimenting, we settled upon:
- Daily standups
- Monthly retrospectives
- Adhoc individual or small group grooming
- Weekly, compressed, estimation and planning meetings
- Hard limits on story sizes
We tackled the migration problem by using, and later extending, a tool for MySQL from Percona called Online Schema Change. It also involved careful management of the code using the database which led to the creation of our first set of deploy safety checks. You can read more about how we incorporated the tool into our workflow for zero downtime migrations.
Front End “Migrations”
However, during a deploy, the front ends are running a mix of the old code and the new code. If a customer loads the page from a new server, but submits to an old server, the request will almost certainly fail. To further combat this, we enabled cookie based session affinity on our ELB during deploys. The affinity ensures that requests are routed back to the server from which they originated, eliminating this problem.
At this point, with our automated functional test suite, streamlined Scrumban process, and online database migrations, we could easily deploy weekly and during business hours. Daily deploys were in our sights, but were still impossible for several reasons:
- The automated integration test suite was too slow and required manual intervention
- Communication and coordination around the deploy was still manual
- The deploy, while semi-automated, still took an operator about an hour to perform
- Deployments involved risks, like accidentally shipping code that wasn’t ready, which were not being mitigated
- The application generated so much exception noise it was nearly impossible to tell when new problems were introduced
Speedup the Suite!
The automated functional test suite took almost a day for a skilled operator to run, a waste of resources and an obvious blocker for daily deploys. However, reducing the runtime and making the suite more reliable was no easy task. To achieve the desired runtime of less than one hour, we “dockerized” our application so we could launch multiple test environments in parallel. With parallelization, runtime of the test suite plummeted to 30 minutes.
While the suite now ran quickly, it was not reliable. It failed almost 50% of the time. The QA team configured the automated test suite to run every hour, 24/7, and cataloged all the failures. Debugging the failures they found was slow and painful. Often the issues were race conditions in our platform, the tests or Capybara itself. After many months of effort, success rates approached the mid-nineties, an acceptable level. The team continues to work to maintain this success rate as the platform, tests, and Capybara/Selenium change underneath the automated test suite.
Risk Management and Communication
At this point, we still had the massive “Go/No-Go” meeting with technical and business stakeholders before each deploy. Every item in the deploy was scrutinized for excess risk, often leading to 11th hour decisions to postpone the deploy, modify the code, or remove an item from the deploy. These changes were often risky and stress inducing. It was already difficult to schedule the meeting on a monthly basis, trying to get all the stakeholders together weekly was nearly impossible. We needed to decentralize the risk management process and, ideally, move it earlier into a story’s life cycle, so it would be less disruptive to deploys.
We started by educating developers on how to assess the risk of a story. In particular, the risk that the story introduces a defect in a different, seemingly unrelated, part of the system. We formalized this process by adding a field to JIRA that contained a summary of the story’s known risks. The field was required to be filled out before the story entered code review. Code reviewers were instructed to review the risk assessment, in addition to the code.
With this change in place, we stopped calling the “Go/No-Go” meetings. At first, everything went well, but after a few months a pattern emerged that indicated we had missed something. Every few deploys our customer service department would be surprised by a change or feature that had been deployed. They were not prepared to support it and/or roll it out to customers. The support team wasted time investigating issues that were, unbeknownst to them, related to changes we has just deployed. We realized that although organizational awareness of stories being deployed was not the purpose of the “Go/No-Go” meetings, it was a very valuable side effect.
Similar to the changes that allowed us to cancel the “Go/No-Go”, we addressed this problem by distributing responsibility along with education and tools. Developers and product managers were educated on the types of changes that required cross team awareness. Obviously, all stories that contain customer visible changes need to be communicated broadly. In addition, stories that have an abnormally high risk, as assessed per the process described, need to be communicated to customer service and support so they can be on the lookout for problems. The team was entrusted with notifying the appropriate teams about upcoming deployments. As a safety net, we added a webhook to JIRA that would post a message in Slack whenever a story was deployed.
All that remained of the old risk management system was the “deploy plan” document. The deploy plan was a template that would be filled in by the team in the week leading up the deploy. It contained:
- A list of the team members who would be “on call” during the deploy.
- A list of all the stories being deployed
- The duration of all the online schema migrations in the deploy
- “Post deploy checks” for all the stories that should be performed after the deploy to ensure the feature is working properly and did not break any adjacent features.
- Verification that all of the stories being deployed were in the correct state in JIRA and had been merged to the deploy branch
- Sign offs for the above
We addressed the “Post deploy checks” by creating fields in JIRA for this data and requiring they be filled in before code review. As a replacement for the “sign off”, the code reviewer was responsible for approving the “Post deploy checks” written in JIRA.
To ensure that we didn’t get stuck waiting hours (or even days) for a deploy to execute, the QA team would execute the online schema migration on a test system, time it, then put the results into the deploy plan document. We replaced this process by creating a simple migration timing tool that estimated how long a migration would take based on how many rows were in the table being migrated. This very simple approach produced results that were more than accurate enough for our needs. We incorporated the estimator into our CI process and made it fail deploys that contained migrations that would take longer than 30 minutes. Long running migrations would be executed separately from normal deploys, typically overnight.
Finally, to provide visibility into what the deploy contained, and verify the related JIRA and Git state, we created a “Pre-deploy Checker” service. The checker receives webhooks from GitHub when a branch is pushed. The checker service diffs the deploy branch with the production branch and extracts the list of JIRA issues it contains from the commit messages. (We use Git pre-commit hooks to ensure each commit contains a JIRA issue number.) The checker then retrieves data about the JIRA issues and checks their state and other attributes. It posts a build status back to the GitHub that contains the result of the checks. In addition, it exposes a web UI where users can see a list of everything that a deploy contains and address any discrepancies the checker found.
With the pre deploy checker service, the post deploy checks in JIRA and the migration timing estimator in place, we were able to do away with the deploy plan entirely. Finally, the manual, centralized, risk mitigation processes had been fully replaced by a distributed, scalable, process that empowered our team members while providing tools and safeguards to protect the business.
Everyone knew we were in exception “bankruptcy” and had been for years. The production platform generated tens of thousands of exceptions each day. The vast majority of which, the team collectively declared “not a problem” or “are probably being retried”. This made finding failures introduced by a deploy virtually impossible.
To make matters worse, the exceptions were being emailed to a GMail account. During outages which generated massive volumes of exceptions, Google would throttle the email account leaving us even more blind to the exceptions.
The first order of business was increasing the visibility of exceptions by using an aggregation service. We chose HoneyBadger.io. Next was chasing the noise out of the system. We let HoneyBadger run for a few weeks, then sorted the exceptions by count to get a list of the worst offenders. We formed a team of experienced developers from several teams, and sequestered them in a conference room for a week with direction to suppress as much noise as possible. They made a dent, but not nearly enough. It took three more of these week long sessions, we called them “Exception Bashes”, to get the noise under control. Then, the real work began: fixing actual bugs previously hidden by the noise. After about 18 months of sustained effort, we got to the point where we could report new exceptions directly into Slack and address them as they happened.
Around the same time we recognized our exception bankruptcy, our operations team pointed out that alerting had a similar problem. Dozens of alerts regularly “flapped” from critical to normal. The operations team had a mental list of alerts that could be “safely” ignored. We followed the same process as the exceptions, tuning or eliminating alerts to ensure they were all reasonable and actionable.
Another blocker for daily deploys was the deploy process itself. While it was semi-automated, it still took an operator about an hour to perform it. We first automated all of the deploy steps by aggregating them into a single Capistrano file, then implemented a Slackbot that could be used by the developers to request deploys. The Slackbot retrieves all of the build statuses for the branch to be deployed (unit tests, functional tests, pre-deploy checker) and posts them to Slack. If the on-call Ops person liked what she sees, she could approve the deploy with another Slackbot command.
This was a massive improvement, but we were still not comfortable enough with the process to allow developers to run deploys without an on-call Ops person being aware. We needed more safety checks to prevent operator error. We added several more checks:
- To prevent accidental rollbacks, a check to ensure the SHA being requested is ahead of the tip of the production branch.
- Ensure that private gems referenced by SHA are being pulled from the master branch of the gem’s repo.
- Ensure that all the secrets required by the SHA are present in the production secret store.
- Run bundler audit to check for security vulnerabilities in our dependencies.
- Run Brakeman to check for security vulnerabilities in our own code.
With all of these checks in place, we removed the requirement for Ops to be on-call, and allowed developers to run deploys at-will.
More recently, we have been extending our deploy automation to check key metrics for services (successful HTTP requests, exception counts, etc) during deploys and halt if they go out of bounds.
It’s been a journey that started nearly three years ago, and of course there is always more work to do, but we have come so far and the whole team is really proud of progress and improvements we have made to date.