9 Steps to Better Deployments

Re-reading Joel Spolsky’s 12 Steps to Better Code prompted me to write down a few thoughts on how to get code from point A to point B. This is highly opinionated, glossing over some bits, and represents just one way I’ve found to minimize many issues that arise in a build and release pipeline.

At the core of this is the concept of a safe system. A safe system is one whose default pathways do not allow dangerous actions.

1. Do trunk-based development

Avoid complicated branching strategies; do trunk-based development. There is only one long-lived branch, and all work comes from this branch. Work should be comprised of small bite-sized pieces that can be easily code reviewed. Each of these small chunks of code should be deployable as they’re merged back into trunk. If the feature isn’t yet functional, the code should be covered by a feature flag or otherwise non-op.

I’ve seen multiple instances of complicated long-lived branching strategies that are a result of internal politics or attempting to overlay a business process onto git branching, and invariably they result in expensive horror stories attempting to merge code back into mainline.

The idea is to keep work flowing in a series of small and easily understood pieces. Most changes should be easily reviewed within a few minutes. This has a number of benefits further down the line in reliability and incident response isolating errors for on-call engineers who may not have been the one making the change.

2. Write tests

Continuous integration and automated (or push-button) deployments without tests is simply building ways to deploy broken code faster. This doesn’t mean that you need to do test-driven development (TDD), but it does mean that appropriate testing to ensure key pieces of your application are working should be present.

The goal isn’t 100% test coverage, but enough tests to give confidence that you’re not breaking your application with each change is critical.

3. Build and test every commit

Continuous integration has many benefits, among them ensuring that every commit on trunk builds appropriately in a neutral build environment without “it works on my machine”. Building and testing every commit is all part of a strategy for finding bugs earlier in the development cycle.

4. Build from source

On occasion I’ve seen some places doing manual deployments from a developer’s workstation, including things like live-editing code on a production webserver to “deploy”. Don’t do that. Every build should come from a single source of truth, a version control repository that contains the latest trunk. This reduces chances of user error and ensures everyone is operating from the same source.

In the instance of the company doing manual live-editing of production, the deployment process was so difficult that the entire application wasn’t deployed at once, only a subset of files for a given change. This eventually resulted in no one having any idea what the differences were between the repository trunk and what was actually running in production. It took several weeks of effort to reconcile the two without introducing more errors.

5. Store artifacts

Each commit that passes tests should result in an artifact that can be stored and deployed into any logical environment. This is known as continuous delivery.

Not every language builds artifacts (Java jars, Go binaries, etc.) nor does everyone want to do RPM or DEB packaging to install their applications in a given environment. The primary idea here is that this artifact is a self-contained snapshot of the application and its dependencies at a point in time. Docker does this handily, if you can also match a given image to a commit or point in time.

6. Deploy tested artifacts

Deployment time should mean selecting from a given set of artifacts that passed all tests. This should keep the duration of a given deployment low, as one artifact is exchanged for another and the build/test period isn’t involved.

For organizations running multiple logical environments, metadata can be kept about these artifacts in order to “promote” them from one environment to the next, ensuring the application meets quality standards at each level.

7. Push button deployments

The most manual effort that should be involved in a deployment is pushing the button to deploy if your organization hasn’t gone the entire way to continuous deployment. This single button should:

  • retrieve the desired artifact from an artifact repository onto each node
  • change the actively served application from old to new
  • report success or failure

Whether it’s the same mechanism or not, rolling back to the last known good release is a must have here. It’s likely that your rollback procedures will be used in an emergency situation. Preemptive work to automate rollbacks will pay dividends in reducing downtime and mean time to recovery (MTTR), ultimately saving dollars on the bottom line.

8. Have criteria for a failed deployment

A few questions need to be readily answerable.

  • What constitutes a failed deployment?
  • How do you know it failed?
  • When do you roll back?

Ideally the responses to these questions can be expressed as code, integrated with a monitoring system that will trigger the rollback if a deployment fails to meet criteria. At the very least, knowing the release failed and notifying the engineer who triggered the deployment and potentially the on-call engineer will provide confidence in the system.

9. Allow engineers to deploy their own code

If all of the above are in solid working order, the only deployable commits are changes that have passed code review, all tests, and have a working rollback strategy with failure criteria. At this point, any engineer should be able to walk a change through to production and observe the effects on the application as a whole. If something has gone wrong, they can roll back to last known good.

For organizations requiring change management, this method breaks apart deployment and release as distinct actions when feature flags are used appropriately.

There is an enormous amount of trust implicit within this system, and it won’t necessarily be for everyone. I believe the benefits outweigh the risks once an organization has reached a level of build and release maturity to make it a possibility. A system built with safety in mind is an empowering tool for engineers and will have great rewards in productivity and velocity of change.