Our Journey To Continuous Deployment

Mike Weaver
Jun 25, 2018 · 13 min read

The Good

When I started at Invoca, almost three years ago, the team was doing an enormous number of things very well. They had:

  • Virtualized staging environments. These could be spun up or down in about 20 minutes making it possible for QA and developers to manually integration test features in an environment that was similar to production.
  • Semi-automated deployments. Operations team members could, using a few commands, executed on a properly configured development machine, deploy the updated software to production and run database migrations.
  • Detailed manual integration test plans. The plans, which covered the majority of the popular features of the application, were run before each deploy ensuring that platform killing bugs very rarely made it into the wild.
  • The production system was heavily instrumented with thousands of metrics plumbed into a Graphite server that powered a home grown alerting system.

The Bad

The team had been running a reliable, albeit labor intensive, development and deployment process for several years. Deploys happened once a month, after business hours, following almost a week of extensive manual integration testing. The monthly deploys often included thousands of lines of code encompassing multiple major features and dozens of bug fixes. A large “deploy plan” document was created for each deploy that listed all the items being shipped, checks to be executed after the deploy, and which team members would be involved. Before each deploy, an enormous “Go/No-Go” meeting was held with all the technical and business stakeholders. This process caused a great deal of problems:

  • The monthly cycle created enormous pressure to ship a feature, even if it wasn’t quite ready, to avoid waiting another month.
  • At the “Go/No-Go” every item in the deploy was scrutinized for excess risk, often leading to 11th hour decisions to postpone the deploy, modify the code, or remove an item from the deploy. These changes themselves were often risky and stress inducing.
  • The QA team had little to do at the start of the month, then was absolutely slammed at the end. This pattern wasted productivity and increased stress.
  • We had to lock users out of the system during deploys that contained database migrations. (With a monthly cycle, virtually all deploys contained migrations.)
  • When things went wrong, it was very difficult to tell which change introduced the bug. Because all of the changes were coupled together, the pressure to find and fix the problem, rather than roll back to the previous stable version, was intense.
  • Critical bug fixes would be shipped in between the monthly deploys, without rigorous integration testing, occasionally leading to new bugs being shipped along with the fix.

Automated Integration Testing

We identified the pre-deploy manual testing as the pain point we wanted to tackle first. Automating the manual testing was the obvious solution, but, doing so was not trivial. After several months of effort, the QA team created a test suite using RSpec, Capybara and Selenium that could be run against one of the virtualized staging environments. At first the suite only covered a portion of the manual testing, but we kept at it until, almost nine months later, all of the manual integration tests had been automated. The suite was painfully slow and full of phantom failures caused by race conditions. But, a well trained operator, could re-run the failed tests and get a “green build” in about a day. This was a massive milestone: the semi-monthly week long QA death march was over.

  • Our customers required several days notice if we were going to lock them out during a deploy. This limited our ability to deploy migrations, often leading to delays shipping features that did not require migrations, but had been merged with other features that did.

Scrum + Kanban = Scrumban

  • Monthly retrospectives
  • Adhoc individual or small group grooming
  • Weekly, compressed, estimation and planning meetings
  • Hard limits on story sizes

Database Migrations

We tackled the migration problem by using, and later extending, a tool for MySQL from Percona called Online Schema Change. It also involved careful management of the code using the database which led to the creation of our first set of deploy safety checks. You can read more about how we incorporated the tool into our workflow for zero downtime migrations.

Front End “Migrations”

Weekly Deploys

At this point, with our automated functional test suite, streamlined Scrumban process, and online database migrations, we could easily deploy weekly and during business hours. Daily deploys were in our sights, but were still impossible for several reasons:

  • Communication and coordination around the deploy was still manual
  • The deploy, while semi-automated, still took an operator about an hour to perform
  • Deployments involved risks, like accidentally shipping code that wasn’t ready, which were not being mitigated
  • The application generated so much exception noise it was nearly impossible to tell when new problems were introduced

Speedup the Suite!

The automated functional test suite took almost a day for a skilled operator to run, a waste of resources and an obvious blocker for daily deploys. However, reducing the runtime and making the suite more reliable was no easy task. To achieve the desired runtime of less than one hour, we “dockerized” our application so we could launch multiple test environments in parallel. With parallelization, runtime of the test suite plummeted to 30 minutes.

Risk Management and Communication

At this point, we still had the massive “Go/No-Go” meeting with technical and business stakeholders before each deploy. Every item in the deploy was scrutinized for excess risk, often leading to 11th hour decisions to postpone the deploy, modify the code, or remove an item from the deploy. These changes were often risky and stress inducing. It was already difficult to schedule the meeting on a monthly basis, trying to get all the stakeholders together weekly was nearly impossible. We needed to decentralize the risk management process and, ideally, move it earlier into a story’s life cycle, so it would be less disruptive to deploys.

  • A list of all the stories being deployed
  • The duration of all the online schema migrations in the deploy
  • “Post deploy checks” for all the stories that should be performed after the deploy to ensure the feature is working properly and did not break any adjacent features.
  • Verification that all of the stories being deployed were in the correct state in JIRA and had been merged to the deploy branch
  • Sign offs for the above

Exception Bankruptcy

Everyone knew we were in exception “bankruptcy” and had been for years. The production platform generated tens of thousands of exceptions each day. The vast majority of which, the team collectively declared “not a problem” or “are probably being retried”. This made finding failures introduced by a deploy virtually impossible.

Deploy Automation

  • Ensure that private gems referenced by SHA are being pulled from the master branch of the gem’s repo.
  • Ensure that all the secrets required by the SHA are present in the production secret store.
  • Run bundler audit to check for security vulnerabilities in our dependencies.
  • Run Brakeman to check for security vulnerabilities in our own code.

Invoca Engineering Blog

Invoca is a SaaS company helping marketers optimize for the most important step in the customer journey: the phone call.

Mike Weaver

Written by

VP of Engineering @ Invoca

Invoca Engineering Blog

Invoca is a SaaS company helping marketers optimize for the most important step in the customer journey: the phone call.