How We Got To Continuous Deployment With Rails, CircleCI, and Heroku

Continuous Deployment in real life

So, you want to deploy continuously? We decided to take the dive at Opendoor to reduce the human cost and error potential of manual deployments. We currently use Github, Circle, and Heroku to deploy our main Rails app, so implementing continuous deployment required a bit of thought to coordinate all of these services.

Continuous deployment had been a long-running desire of our engineers, but there were perennial tooling constraints that had made it just difficult enough that it never happened. Serendipitously, the release of Circle Workflows over the summer gave us the primitives within Circle to guarantee atomic lock-step releases. That, plus a recent switch to using Heroku Review Apps, made the migration more tractable.

We wanted to share some code that runs our deployments and note some best practices for deploying the specific stack we’re using.

Lifecycle of a Deploy

Before we get into the nitty-gritty, here’s how a commit eventually lands on production:

  • The pull request runs tests and is marked as good to merge into master
  • The engineer merges the PR and tests are again run against master
  • That commit of master deploys to our centralized staging server
  • Sanity checks run automatically on the staging server
  • That commit then deploys to production
  • Sanity checks run on production
  • Success and failure notifications are sent via Slack at every step

Our process biases towards redundancy and taking extra precautions to ensure production stability. Your team’s implementation might look different — for example, some deployment environments can deploy new code to a percentage of production instead of having a dedicated staging environment.

Orchestrating Circle Workflows

Workflows are how we kick-off tests and deployments on Circle. We run a variety of tests on every pull-request and on the master branch upon merging; when all of those workflow jobs finish, we kick off a deployment job. The relevant part of our Circle configuration looks like this:

workflows:
version: 2
pr:
jobs:
- lint
- unit_test
- ...
- deploy:
requires:
- lint
- unit_test
- ...
filters:
branches:
only: master

Having each of your “pre-deploy” tests separated into their own semantic workflow jobs is helpful for alerting engineers as soon as possible if a particular step fails on a pull-request, without waiting for the whole workflow to finish. For example, linting takes less than a minute, and having it run independently of the longer unit and integration tests means a lint failure notification is sent minutes earlier.

The deploy job doesn’t require much setup to deploy to Heroku — we only need to checkout the code and install any dependencies our deploy scripts need. That part of our Circle config looks like:

version: 2
jobs:
deploy:
docker:
- ...
parallelism: 1
steps:
- checkout
- ...
- run:
name: Deploy
command: bin/ci/circle-lock --branch master --job-name deploy bin/ci/deploy

In the last step, we wrap our underlying bin/ci/deploy script with a circle-lock script. The lock script enforces that only one deploy job on the master branch can be running at a time; if there are multiple, the ones occurring afterward will poll Circle until they are unblocked.

We adapted this script posted in the Circle forums to support one additional behavior — if you have three deploy builds, the “middle” builds will exit when they detect the latest build. We added this behavior to deal with stacking deploys (more on that below).

We do a number of tasks inside our deploy script. Most of our deploy communication happens using the Heroku API, which is authenticated through environment variables. The Circle docs have a script that we used verbatim to configure the build environment at the start of the deploy. We use JQ heavily to parse the Heroku API inside our bash scripts.

Stacking Deploys

Once your team hits a certain size, they will be merging code faster than your system can deploy. You need to consider how you want “stacking” deploys to work in your continuous deployment setup.

In one type of setup, each deploy will have exactly one merge. There are benefits:

  • Makes it as easy as possible to pin-point what version of code caused a regression
  • Determining whether a deploy can be safely rolled back is simpler
  • In general, it is a more controlled way of deploying code

However, it means that deploys can back-up for hours, depending on how long each deploy takes.

The other type of setup “skips” stacked deploys, which is what we chose to get code out into the wild faster, at the cost of some deploys growing in size. We adapted the circle-lock script into our own version which automatically skips if later builds are detected. This implicitly relies on Circle build numbers being incremented as the branch moves forward, so we don’t rebuild old master branch deploys. If a deploy flakes and there’s another one coming, we wait for that.

Generating Release Notes

To notify the team (see below for more on that), at deploy time we generate notes about the commits about to be deployed. We use the Heroku API to get the “current” release, figure out the diff using vanilla git, and then format them for consumption. The code looks something like this:

DEPLOY_SHA=$(git rev-parse --short HEAD)
CURRENT_SHA=$(heroku releases -a ${APP_NAME} | awk '/Deploy/ {print $3}' | head -n 1)
DEPLOY_NOTES="/tmp/deploy_notes.txt"
git --no-pager log --reverse\
--pretty=format:"{%\"author_name%\":%\"%an%\", %\"title%\": %\"%s%\", %\"title_link%\": %\"${GITHUB_URL}/commit/%h%\", %\"text%\": %\"%b%\"},"\
--no-merges ${DEPLOY_BRANCH}...${CURRENT_SHA} > ${DEPLOY_NOTES}

Heroku Configuration

If you haven’t already, enable Preboot on your Heroku app. Without preboot, your app probably has anywhere from moments to minutes of downtime during a deploy. That might have been fine if you were deploying at non-peak hours once a day, but in a continuously deploying world you can count on multiple deploys per hour.

The only gotcha with preboot is there is a time period where your new servers are starting up and your old servers are still serving code. This does not mean requests are getting routed to new and old code simultaneously; instead, you should be aware that any connections your new code establishes during boot-up will be occurring while the old code still maintains its connections.

We use Postgres at Opendoor, which means we have to be especially aware of the concurrent database connections doubling during deploys. Using something like PgBouncer can help with this, or any other pooling proxy for the data stores your code is connecting.

Risk Management

Much like self-driving cars, our self-driving deploys need to have safety checks at every step of the process and allow for human intervention.

After the Heroku step of the staging deployment finishes, we first check that the servers become available using the Heroku API. Something like this should work for many Heroku apps:

down_dynos = []
while (started + timeout) > Time.now
response = heroku_api_get("https://api.heroku.com/apps/#{app}/dynos")
dyno_states = JSON.parse(response.body).map do |e|
e.values_at('name', 'state')
end
down_dynos = dyno_states.select { |_, state| state != 'up' }
break if down_dynos.any? { |_, state| state == 'crashed' }
break if down_dynos.empty?
end

Next, we check that certain critical endpoints like the homepage return 200 responses. We also have all of these checks in our testing suite at the unit and integration levels, but still perform one last triple-check before pushing to production.

Before deployment to production starts, we check if the latest “release” on Heroku is a rollback. If we see that it’s a rollback, we take it as a sign that the code may not be in a releasable state (i.e. because it contains a regression). Our deploy script check looks like this:

if [[ "$CI" == "true" ]]; then
LAST_DEPLOY=$(heroku releases -a ${APP_NAME} --json | jq '.[0].description')
set +e
echo $LAST_DEPLOY | grep "Rollback to"
if [[ $? == 0 ]]; then
echo "The current release is a rollback; can't deploy"
exit 0
fi
set -e
fi

In the event of a rollback, an engineer must manually deploy from their machine to cause the latest release status to change and implicitly “re-activate” continuous deployment.

Finally, we can set a Circle environment variable to act as a kill-switch for production continuous deployment. We haven’t had to use this yet, but it’s the final safeguard before code is allowed to ship to production.

if [[ "$CI" == "true" && "$PRODUCTION_CD_ENABLED" != "true" && "$DEPLOY_ENV" == "production" ]]; then
exit 0
fi

Slack Notifications

Engineers usually want to know when their code lands on production without continuously checking the status of Circle. We have a Slack-based notification systems in place to help with this.

We have a general #alerts-deploy channel for automated messages sent from the deploy script and the Heroku Slack app. The default Heroku integration is easy to setup and is the “canonical” source of when your new code is deployed; however, we also need our own Slack messages within the deploy script to message Opendoor-specific errors and status updates.

For example, when our “check whether dynos started” assertion fails, we send an @oncall-flavored alert to the Slack channel for our on-call team to investigate.

We also echo the release notes generated earlier into the Slack channel:

Continuing the release note code from earlier, our code to send that to Slack looks like this to transform the notes into the correct JSON:

DEPLOY_NOTES_SAFE=$(sed 's/\"/\\"/g; s/%\\\"/\"/g' ${DEPLOY_NOTES})
slack "Deployed to production" "${DEPLOY_NOTES_SAFE%?}"

Impact

We’ve been on continuous deployment for a few months. It has freed up engineering hours from monitoring manual deployments and even caught show-stopping bugs earlier in the process.

From a typical engineer’s perspective this process of “deploying” is decoupled from the specific platforms on which our code runs. As we think ahead to when and if we want to change infrastructure and tools, we can iterate on our platform without changing day-to-day workflows.

We’re looking for engineers of all backgrounds to build products and technologies that empower everyone with the freedom to move.

Find out more about Opendoor jobs on StackShare or on our careers site.