Release Engineering

How do you release software in a safe way, with reliability in mind? How do you bring together your development process with SRE practices for getting that software out to your customers, without introducing unneeded complexity or fragile systems.

Scene from Google Cloud Next London 2017

I have been giving commentary on the SRE book, which can be found online. Release Engineering is Chapter 8. As per a request from @asatarin one of my twitter followers I am skipping ahead to here.

A lot of the content in this chapter is not a manual for how to do good releases, but rather describing what Google does. I will try to translate from the Google-specific stuff to what you might find valuable in your own situation.


Release Engineering

Written by Dinah McNutt
Edited by Betsy Beyer and Tim Harvey

Release engineering is a relatively new and fast-growing discipline of software engineering that can be concisely described as building and delivering software [McN14a]. Release engineers have a solid (if not expert) understanding of source code management, compilers, build configuration languages, automated build tools, package managers, and installers. Their skill set includes deep knowledge of multiple domains: development, configuration management, test integration, system administration, and customer support.

The worst thing you can do is not pay enough attention and focus on doing good release engineering. This ties up a good engineer who could be writing code or designing new systems: but this is also where the rubber meets the road: without good practices from someone who really knows what they’re doing, you might cause outages.

As per the introduction to the SRE book: “SRE has found that roughly 70% of outages are due to changes in a live system, …”

Running reliable services requires reliable release processes. Site Reliability Engineers (SREs) need to know that the binaries and configurations they use are built in a reproducible, automated way so that releases are repeatable and aren’t “unique snowflakes.” Changes to any aspect of the release process should be intentional, rather than accidental. SREs care about this process from source code to deployment.
Release engineering is a specific job function at Google. Release engineers work with software engineers (SWEs) in product development and SREs to define all the steps required to release software — from how the software is stored in the source code repository, to build rules for compilation, to how testing, packaging, and deployment are conducted.

Not all Google teams have release engineers, and you shouldn’t aspire to this: generally release engineers are engineers who are focused on the complex and novel cases, streamlining them. Often they can set things up and let anyone follow a process.

Some of the tools described in this chapter are very arcane, and I would never want to configure them, but am perfectly fine with pressing “go” on them. I’m confident that baroque custom build configurations are not unique to Google.


The Role of a Release Engineer

Google is a data-driven company and release engineering follows suit. We have tools that report on a host of metrics, such as how much time it takes for a code change to be deployed into production (in other words, release velocity) and statistics on what features are being used in build configuration files [Ada15]. Most of these tools were envisioned and developed by release engineers.

Envisioned and developed by release engineers, but managers and directors really care about them: Release velocity on a project going-poorly might be measured in weeks or months, and this is exactly the sort of data that a management chain should have visibility on so they can direct resources into fixing it.

A question to ask: “If I submit my fix now, how long until our customers will see that change?” Is it minutes? Hours? Days? Months? Can you measure it?

Release engineers define best practices for using our tools in order to make sure projects are released using consistent and repeatable methodologies. Our best practices cover all elements of the release process. Examples include compiler flags, formats for build identification tags, and required steps during a build. Making sure that our tools behave correctly by default and are adequately documented makes it easy for teams to stay focused on features and users, rather than spending time reinventing the wheel (poorly) when it comes to releasing software.

If I never see a custom shell script that only works on the original developers desktop for building the project, ever again, it will be too soon. Having an entire job role devoted to making sure that never happens is okay by me.

Google has a large number of SREs who are charged with safely deploying products and keeping Google services up and running. In order to make sure our release processes meet business requirements, release engineers and SREs work together to develop strategies for canarying changes, pushing out new releases without interrupting services, and rolling back features that demonstrate problems.

The level of engagement varies wildly. Some SRE teams work on every single release and follow it from start to finish. But this doesn’t scale across very well: When I worked in Ads, we had more than 20 development teams all releasing software, at a rate of least one release every two weeks. So we made our development teams responsible for their release process and demanded it simply be consistent with existing releases so that SRE didn’t have to learn a new process each time.


Philosophy

Release engineering is guided by an engineering and service philosophy that’s expressed through four major principles, detailed in the following sections.

Self-Service Model

In order to work at scale, teams must be self-sufficient. Release engineering has developed best practices and tools that allow our product development teams to control and run their own release processes. Although we have thousands of engineers and products, we can achieve a high release velocity because individual teams can decide how often and when to release new versions of their products. Release processes can be automated to the point that they require minimal involvement by the engineers, and many projects are automatically built and released using a combination of our automated build system and our deployment tools. Releases are truly automatic, and only require engineer involvement if and when problems arise.

“At Scale” is such an overused term I feel. Ideally, the number of release engineers we need to deploy to handle a number of systems or client teams should be sublinear with the growth of those teams. An O(logN) function, if you will.

The terrifying opposite is the old-school sysadmin approach, where a corporate would hire 1 sysadmin per N servers they provisioned. That’s the opposite of scaling your human engagements.

High Velocity

User-facing software (such as many components of Google Search) is rebuilt frequently, as we aim to roll out customer-facing features as quickly as possible. We have embraced the philosophy that frequent releases result in fewer changes between versions. This approach makes testing and troubleshooting easier. Some teams perform hourly builds and then select the version to actually deploy to production from the resulting pool of builds. Selection is based upon the test results and the features contained in a given build. Other teams have adopted a “Push on Green” release model and deploy every build that passes all tests [Kle14].

High velocity, such as push-on-green goes hand-in-hand with excellent QA procedures.

We discussed in earlier chapters error budgets, and acknowledge the risk that new software presents. If you are okay with a 99% uptime system, then pushing every green build might be acceptable, but if a bad push takes more than 26 seconds to detect and roll-back, then you would consume the entire error budget for a month of a 99.999% system, with one bad push.

Much more typical for high-reliability systems is a weekly or daily release cycle, moving the release out in tiny stages, with automated analysis to make sure that no regressions have crept in, and so if they take a while to detect, only a small percentage of your serving is affected.

If you have a large error budget, you can of course release more often. Having a big error budget is a luxury you should appreciate, and this is where you can spend it responsibly!

Hermetic Builds

Build tools must allow us to ensure consistency and repeatability. If two people attempt to build the same product at the same revision number in the source code repository on different machines, we expect identical results.36 Our builds are hermetic, meaning that they are insensitive to the libraries and other software installed on the build machine. Instead, builds depend on known versions of build tools, such as compilers, and dependencies, such as libraries. The build process is self-contained and must not rely on services that are external to the build environment.

We make our builds hermetic through two major techniques:

This is of course not a viable approach for many organisations, and is very unusual. Instead a modern technique for making hermetic software releases is to use Docker: By building your release into a Docker container, you can be sure the version running in production is hermetic and identical to the version you built and tested.

Rebuilding older releases when we need to fix a bug in software that’s running in production can be a challenge. We accomplish this task by rebuilding at the same revision as the original build and including specific changes that were submitted after that point in time. We call this tactic cherry picking. Our build tools are themselves versioned based on the revision in the source code repository for the project being built. Therefore, a project built last month won’t use this month’s version of the compiler if a cherry pick is required, because that version may contain incompatible or undesired features.

Cherrypicks are, fortunately, very easy to do in modern version control systems and should be well understood.

Enforcement of Policies and Procedures

Several layers of security and access control determine who can perform specific operations when releasing a project. Gated operations include:
  • Approving source code changes — this operation is managed through configuration files scattered throughout the codebase
  • Specifying the actions to be performed during the release process
  • Creating a new release
  • Approving the initial integration proposal (which is a request to perform a build at a specific revision number in the source code repository) and subsequent cherry picks
  • Deploying a new release
  • Making changes to a project’s build configuration
Almost all changes to the codebase require a code review, which is a streamlined action integrated into our normal developer workflow. Our automated release system produces a report of all changes contained in a release, which is archived with other build artifacts. By allowing SREs to understand what changes are included in a new release of a project, this report can expedite troubleshooting when there are problems with a release.

Strong authentication and authorisation is of course an extremely useful thing to have. The principle here is to not release builds made on developer machines: Have a secure build server that independently retrieves the right code from version control and builds it in a verifiable way.

You want to have a chain of trust so that you can be absolutely sure that code running on your servers is exactly what was in your version control, and doesn’t have any local edits or weird dependencies linked in.


My Thoughts

The rest of this chapter talks about specific tools to provide the above principles. I may give that a commentary, but intend to go in another direction, and talk about my own thoughts here.

A software engineering principle around simplicity has always stuck with me from the first time I’ve heard it:

“The best possible user interface has only one button, and ideally they press it in the factory for you so you don’t have to.”

An excellent release process should be like this. The more complex you make your release procedures, the more likely that corners are going to be cut.

The worst instructions I see are things like “Push the release to 10%, and check the graphs to see if everything’s okay.” — no one will do a good job of that!

With an excellent application of the above philosophy (4 principles are titled “Self-Service Model”, “High Velocity”, “Hermetic Builds” and “Enforcement of Policies and Procedures”) you should be able to build a release process that does the right things for your chosen reliability, and be sufficiently simple.

As an SRE or release engineer, who wants to be able to scale your engagement across your organisation, you should aim to have a release process in place that safely releases your software as fast as is appropriate as often as is appropriate, and automatically rolls back bad releases.

In an ideal world: You only get notified after something goes tragically wrong, because simple thing that go wrong result in implicit and safe rollbacks.

Downtime For Releases

Something that’s missing here is a discussion of the disruption that releases cause: How do you account for planned downtime during the release? What about data format upgrades?

The real answers for this lie outside of release engineering — these are pieces that have to be built from ground up in the product. These are business and product decisions like:

  • Separate your binary roll out from new feature activation: Use flags to safeguard new features, turn them on slowly, and turn them off if they don’t work.
  • Don’t accept downtime for data format upgrades: Support both formats during a migration period, even if this means you have to go through more release cycles.
  • Run multiple versions concurrently: Have multiple versions that you load balance between. If something is detected to be wrong with the release, stop serving from the newer version! This implicitly requires that roll backs always work (because some of your customers are still on the last version).
  • Remove all single-points-of-failure: If you have a SPOF, then you cannot upgrade that component without downtime or degraded serving. Often this requires additional cost because of the need for multiple servers and replication, but it’s worth it to get past 99% reliability.

Parting Thoughts

Start small: Make your builds consistent. Make it possible to deploy a new version to production with the touch of a button, and once you have good release qualification and monitoring in place (You have SLOs and Error Budgets!), make the roll outs happen automatically!

You may have a bumpy road: Remember that 70% figure: Releases cause outages, but the road will get more predictable and smoother when you have a consistent and verifiable process.


I am a Site Reliability Engineer at Google, annotating the SRE book on medium. The opinions stated here are my own, not those of my company.

Show your support

Clapping shows how much you appreciated Stephen Thorne’s story.