One take on release management

A story about how software is delivered in the project I am currently working on

Andrew Howden
Y1 Digital
12 min readMay 6, 2018

--

Release Management

Defined on Wikipedia as:

Release management is the process of managing, planning, scheduling and controlling a software build through different stages and environments; including testing and deploying software releases.
https://en.wikipedia.org/wiki/Release_management

In terms of my day to day it means graduating some changes I have implemented from the local development environment through testing and quality assurance, into the production environment. The development process looks as follows:

Or, expressed as steps:

  1. Checkout a new branch from Master, which is what is in production. That new branch is the Story Branch.
  2. Checkout a new branch from the Story Branch which will become the Feature Branch. Add some changes that improve something in some way
  3. Submit those changes for Code Review. On completion, release them to QA in a test environment spun up for the purpose
  4. Should QA decide to approve the changes, release them into production and merge them back to master.

The requirements

This creates a number of requirements that must be followed during the release process:

  1. A local development environment must exist that provides an “adequately similar” version of production.
  2. It must be possible to easily create test environments for the QA team to verify a story meets requirements and does not degrade functionality.
  3. It must be possible to deploy to production in and easy and safe way.

Additionally, at the time the technology choice was limited to:

  • A hosted service, or
  • AWS EC2 somehow

due to limitations in the breadth of knowledge required for other solutions (for example, Kubernetes).

Eschewing Management

There are an abundance of managed hosting services that provide suitable tooling for releasing software; among them Magento’s official “Magento Cloud” product. Additionally, it stands to reason that those who build these services for third party consumption are inherently going to be better at building these technologies than others who have the mixed responsibility of both application and infrastructure development.

However, in my experience, hosted platforms do not provide the level of service that we can provide for the projects that we implement. Additionally, because they are in many cases unwilling to provide the bespoke requirements I request (for example, Prometheus support, or a specific version of a given library installed). Additionally, there have been many circumstances in which during critical issues we are required to adhere to the limited support offerings provided by hosted services, and in many cases they’re simply not invested as we in determining the root cause of an issue.

Accordingly, in this case we opted for managing the application and environment on top of the EC2 abstraction provided by Amazon Web Services. This provided a suitable level of control (i.e. a virtualised host, but managed storage and hardware) allowing us to specify the project to our bespoke requirements.

We used Ansible as our tooling for both infrastructure provisioning (i.e. the creation of EC2s, Route53 records etc) as well as the management of the host, and application release process.

What’s in a release

When designing this project I was also playing with Kubernetes (k8s). Part of the lessons endorsed by k8s as well as Keif Morris in his excellent Infrastructure as Code book is that all changes must be expressed as infrastructure as code.

We only have the one control to production systems, which is a “deployment”. Accordingly, we must either invent a new control loop — perhaps called “infrastructure” — and decide on the responsibility split between what’s in infrastructure management and what’s in application management.

We opted instead not to do this, and rather consider all state changes to a production system as part of a single release. This means that a deployment can mean:

  • Updating an application
  • Changing server configuration
  • Running some data migration process
  • Tearing up or down infrastructure

It’s all managed in the lifecycle illustrated below. This makes it very simple to manage the project, and does not hide any of the complexity of theses systems behind third party services (except, perhaps, some non-automated service discovery — monitoring. et. al)

The software delivery pipeline

There are two software delivery pipelines in this project. The environments created by these pipelines are identical, however the environment lifecycle itself is different — thus the two pipelines.

Additionally, the pipeline is both triggered and managed by version control. That is, the project deploys (a limited subset of) itself.

Testing

The testing pipeline is a 5 step pipeline run in BitBucket pipelines, which works as follows:

There are two manual triggers, defined above with the word Initiate. Practically speaking, the triggers are buttons in the BitBucket UI interface — either a custom build target or a “manual approval” button.

Each build is executed against a specific commit, and is tied only to that commit. Additionally, testing environments use the hashed branch name as a unique ID to create and destroy the correct infrastructure.

Additionally, while it is possible to execute in single step, it is heavily discouraged. Deployments are one single unit of “everything to make this project work”.

Each step is as follows:

Provision Infrastructure: Ansible requisitions an EC2 instance of specifications supplied as part of the infrastructure declaration. The instances are tagged with the appropriate cost targets and configured not to be “safe”. Route53 records are added to make the newly provisioned machine to be discoverable.

Build Application Tarball: The Magento application is setup using the no-database build style possible after the 2.2 release. Code is vendored, static assets, the DI and any other build executed and the resulting application stored in a tarball in S3.

Provision Server: The newly created EC2 is provisioned with an Ansible playbook. This playbook, like all other aspects of this project, is committed to the same repository. It installs and configures the server exactly as it would be in the production environment — with a couple of exceptions. Namely, test data is added to the system and some safeties put in place to ensure that egress from the system is impossible (such as using mailhog instead of sendgrid as the SMTP upstream)

The roles from the server are all largely open source, either supplied by the third party providers or published ourselves on GitHub. In this way we can give back to the open source community which allowed us this style of management.

The role providers that I prefer are:

  • Geerlingguy
  • Juju4
  • Debops
  • Sitewards

Ansible playbooks are (well, should be) idempotent. They are executed with every deployment to ensure that the server remains in the expected state, and discourage ad-hoc changes to the server in testing or production environments.

Release App Tarball: Ansible executes another playbook designed to manage the application. It runs the opened sourced Magento 2 role to unpack the release into the appropriate directory in the production system, wire in any state (such a media or logs) symlink it in so production facing.

The role handles tasks such as setting up the cron tasks required for Magento, enabling maintenance mode prior to release, releasing through symlink such that downtime is extremely minimal.

At this time, the zero downtime changes possible as a result of the recent ability to query whether there are any database changes to be made has not been implemented. Practically, downtime is < 30s, and there is a nice error screen that is shown to users apologising and asking them to refresh.

Tear Down Infrastructure: Once the work has been through QA the environment is torn down. It’s no longer required, and we can save cash.

Production

Production is almost the same as testing:

However, there is some nastyness inherent to the staging and production: state management.

To explain, each testing environment is an exact replica of production if production was provisioned at the current commit. However, production was provisioned at an older commit. Older configuration is applied, and previously created files that are not referenced in the newer commit may still exist.

Accordingly, with a staging release we are not only testing our server specification and application work correctly — that was already verified in the temporary testing environment — but rather testing the management of state.

Additionally, unlike the testing environment production does not get its own application build. Instead, a single application tarball is built and then first deployed on staging, then deployed on production. In this way we can catch any build failures that the deploy process didn’t catch automatically.

Lastly, we do not do infrastructure management in production. This was a calculated risk — it disallows us from making changes, but it also prevents the footgun of accidentally deleting the production system with stupid inventory changes. In future, we will probably elect to also do infra management in production.

Lessons Learend

There have been a number of illustrative lessons as a result of designing this software release, which we hope to address as a result of this implementation. In no particular order:

Long deploys decrease iteration loop

Implementing automation on this project dramatically increased both the number and the safety of each release. In many cases I will create a test environment with an extremely simple change, and delegate the handling of this environment to QA or another third party.

However, there are some cases in which the automation itself is being tested, or a critical change must be shipped as quickly as possible (a hotfix in live). In these cases, the length of time that the automation takes is difficult. A deployment to the testing environment takes ~25 minutes, and the deployment to production ~45. I have (to my peril) skipped the release process to implement a change previously, and in emergencies it’s possible to execute only parts of the build. However, both of these solutions are less than ideal — the build should not take long periods of time to execute.

The goal I would hope to work towards is a releasable build in ~5 minutes. This should be possible with Magento 2 (the rate limiting step is likely the app compilation), and has been implemented in other places with simpler projects.

Limited delivery control was a blessing

The delivery pipeline only has a super limited number of controls:

  1. Start
  2. Continue
  3. Stop

This felt super frustrating at first. It was impossible to do things such as “deploy x version to y environment” or other common pipeline tasks.

However, practically speaking this meant these decisions needed to be embedded into the code and considered ahead of time. The logic (after a few iterations) is now sound, and we have the appropriate deployments in all cases.

Not being able to influence the build at build time lead to a superior release cycle.

Independent tooling paid off

At Sitewards, there is a number of internal services that are also designed to help manage the release of projects. This project deliberately eschewed these tools, initially as the approach for delivery was significantly different than those tried before.

However, there were a number of benefits to constructing the project with it’s only dependency on a CI/CD system. It allowed the project to evolve independently of other internal projects, up until (for this project) the delivery process we have is among the best of all projects in the company. Additionally, where the process is beginning to be duplicated we are forced to find vendoring mechanisms for our solutions (such as Ansible roles) rather than relying on centralised management of delivery. Lastly, should it be required that we hand this project over to a third party, we can hand over all tooling required for the running and maintenance of the project (after revocation of credentials).

Infrastructure as code allows large increases in visibility

Because the project was managed to the linux kernel in code, we can also modify it’s state down to the kernel. This allowed increases in visibility in two ways:

  1. It’s clear what’s happening on all machines. Some configuration drift granted, there should not be any random services that behave in a way we have long forgotten about.
  2. It’s possible to add additional tooling that will track and save application behaviour for later analysis

In particular, this environment is heavily instrumented with Prometheus (most services having an exporter of some kind) and logs aggregated to a central log daemon systemd-journald. Given any issue that’s happening in the production environment it’s usually trivial to determine when and how the issue was caused.

Additionally, the replicability of environment means that if an issue is in live there is an extremely high likelihood that it’s also present in testing environments. Solutions can be created and tested, and shipped back to production in a reliable way. Indeed, our QA team has caught bugs happening in the production environment by testing the test systems.

Ansible was the right balance of complexity

During the requirements phase of this project several tools were evaluated in determining what the technology stack should be, including:

  • Docker (Kubernetes)
  • Managed Services
  • Chef/Puppet
  • Managed Delivery

In the end, Ansible was chosen as it’s implementation was also happening in another couple of projects successfully.

Ansible was a good choice for the central tooling of this project. It is not super complex for those with some systems administration experience, and has an huge amount of well written reusable components for a wide number of tasks. It’s quite common for me to decide that a service would be a good idea, and find a well written role that implement this service.

Perhaps future projects will look at other approaches of managing software, but Ansible is certainly the leading candidate for projects based on the VM abstraction.

Managing State Blows

At this time I have managed projects both based on the container abstraction (with Kubernetes) and VMs (Ansible). Of the two, I would comfortably recommend the former for long term delivery.

The problem is that Ansible does not give any strong guidance for how to structure services or state. In the case of this project, it has:

  • Varnish
  • NGINX (as an ingress and FastCGI Proxy)
  • MySQL
  • Redis
  • Falco
  • OSQuery
  • VSFTPD

And many more different components. These components are all housed on ${N} machines, configured in a bespoke way to ensure those services play well together. Additionally, the machines themselves have data that must be explicitly managed through backups etc.

However, coming from the environment where services are both completely independent and essentially overhead free on a Kubernetes cluster and state must be deliberately handled outside the application lifecycle, managing the complexities of this system is not trivial. It is difficult to see dependencies between services, or when one service might interact with another.

Kubernetes has it’s own issues, and this is not a “Kubernetes vs” post — however, state management still sucks.

Multiple testing environments is handy

The nature of this pipeline means that we have ${N} testing environments — one for each story. This means that multiple streams of work can be tested and released completely independently, and work does not block other work. Additionally, there are no implicit dependencies between different streams of work.

Full replication of the production environment locally is a bad idea

When working on the project locally to analyse some issue, it’s far more likely that issue will be associated with the application rather than the infrastructure. While it’s possible to replicate the production environment locally, the production environment is ill suited for the rapid iteration that local development requires to ensure that our velocity is as high as possible. Additionally, it’s often super difficult for other developers to debug their local environments with this host of technologies they’re unfamiliar with, and that they do not need to understand for their task.

Instead, an extremely limited version of the production environment is replicated locally, and it’s lifecycle is handled separately than testing and production.

Writing CI/CD independently of tooling paid off

The bitbucket-pipelines.yml configuration in this project is extremely simple:

Instead, all of the complex build logic is executed in a build script at build/ci/ci.sh and written in bash. This allows a much larger set of tools to configure the build, and centralises all logic associated.

Additionally, it would allow easy portability between build systems.

In the future I would not recommend writing the build configuration in bash. While the language is extremely powerful, it also has quite a number of sharp edges and weird behaviours.

There needs to be an independent cleanup process

There are two ways in which this release process commonly goes wrong:

  1. A branch is deleted before an environment is torn down, or
  2. Tarballs are never deleted

In both cases this means that there are created resources dangling, and those resources must invoke some cost. In order to clean up after the build there should be an out-of-release reconciliation process to ensure that any dangling artifacts are cleaned up.

One potential solution for this is to create a new build task that simply checked for machines without branches, and deleted tarballs older than a certain date.

In Summary

This was the first project in which we designed a fully self contained software delivery pipeline. It’s benefits have been immense, and I would consider it a requirement for any future project.

We learned a number of lessons in this project that I will take to the next. Container based delivery systems promise to relieve some of the burden still left in managing software, but I feel the jump will not be as significant as going from manual management to the software delivery pipeline.

Thanks

  • Ryan Fowler, as our conversation inspired the post and he was kind enough to review a draft of the article.
  • In no particular order: Anton Boritskiy, Zhivko Antonov, Patrick Kubica, Behrouz Abbasi, Aario Shahbany, Kelsey Hightower, Anton Siniorg, Winston Nolan, Keif Morris, Jeff Geerling, the Google SRE book editors and authors. They have all contributed with literature, code or discussion to the above.

--

--