Our journey to Continuous Deployment: Part 1
Overcoming the technical challenges
At Crunch, we develop our own software in house. Over the years we’ve evolved and improved our development practices to overcome the challenges we’ve faced. We’ve been on a journey for the last three years and have finally achieved our goal of practising continuous deployment (CD), using a cutting edge blend of technical, cultural and process improvements which allow us to automatically deploy changes to production after a developer has pushed their code.
In this two part post we’ll explore the challenges we faced, and the solutions we came up with.
Part one begins by exploring the motivation behind starting this journey, and discuss how we made the business case for the significant investment required. We’ll then explore some of the technical challenges we faced.
In part two, we’ll focus more on how we overcame cultural challenges and dealt with process change within the Crunch Engineering department.
Once upon a time…
Three years ago, most development at Crunch was focused on our main accounting product, which at the time was deployed as a classic Java monolithic application. Teams had noticed that it was taking longer and longer to work on new features, with strange and difficult to track bugs appearing in unrelated features. Any change to a part of the application required the entire application being deployed. Having multiple teams working at the same time required significant coordination to avoid treading on each other’s toes.
Another key bottleneck was that due to technical constraints, every deployment required around 10 minutes of downtime. To avoid disrupting clients and Crunch staff who were actively using the application, this meant deployments could only be done out of hours, normally on a Sunday night. Deployments also required manual effort, especially if database migrations were involved.
Services, services, services
The first initial solution to these issues was to move towards a service-oriented architecture (SOA) with microservices. Microservices are a popular modern architectural pattern where a system is composed of many smaller subsystems (or microservices) that are cleanly separated from each other. This makes it easier to work on features in isolation, as the effects of updating a service can usually be confined to a smaller area.
There are also many tradeoffs to consider with this type of system. Whilst services can be deployed in isolation, with fewer dependencies and less coupling between them, deploying several services is harder and more complex than deploying a single monolith. This drove the need for greater automation of deployments. Whilst a monolith naturally has centralised logging and monitoring, microservices need additional infrastructure to view logs in a central place. We realised we needed to develop an entirely new automated platform on which to deploy our services.
Solving our existing technical challenges
Building fresh infrastructure from scratch gave us the opportunity to solve many of the issues we had with the existing infrastructure.
From the few microservices we had developed prior to starting work on the new platform, it was obvious that a manual deployment process was simply not scalable to more than a few services. Automating deployments would give us a huge number of other benefits:
- Faster deployments
- Less susceptibility to human error
- Preventing human operators from spending time doing repetitive manual work (toil) and freeing them up to do more important/useful work
- The possibility of implementing much more complex deployment strategies such as zero-downtime rolling upgrades which are very difficult to perform with a manual process
- Deployments during business hours due to reduced downtime
The biggest benefit, however, is reducing risk. All of the above have meant that we’ve significantly increased the frequency of deployments (to around 10 per day on average). In turn, this means that each deployment contains fewer changes*, which makes them less risky, and easier to figure out if a change has broken something.
*the smallest deployment we’ve ever done was a single commit with a one-character copy change!
How did we go from an almost entirely manual process to almost entirely automated? Any release process, even a manual one, can be modelled as a series of tasks that need to happen, or a pipeline:
Like many other development teams we were already automatically building & testing our repositories, which is Continuous Integration. A CI server (such as Jenkins) polls source control repository and automatically compiles the code and runs tests every time a new change is pushed.
What we wanted to move towards initially was Continuous Delivery, which looks something like this:
Each task in this workflow is what we call a pipeline. A pipeline is composed of stages, which run in sequence, which ensures that something is only deployed once anything it depends on is already there. For example, we can’t deploy Redis unless we have a networking environment set up to deploy into:
Note that the stages can have multiple jobs running in parallel, which speeds things up.
In our case, we decided to use GoCD for the way it models pipelines as first class citizens. It has some neat features, like the ability to track the path of a single commit to production in the Value Stream Mapping view.
Note (in the flow chart above) the presence of a manual step right before the deploy to production. This would allow us to verify that we were happy with deploying a particular build.
The ultimate goal, however, was to remove that manual step by being supremely confident that the pipeline would automatically block any broken builds from being deployed. This would take us from the Continuous Delivery approach to Continuous Deployment. This is how our release pipeline looks now (for one of our many services) in GoCD:
Building the confidence in our automation took a huge amount of effort and work in our processes and tooling.
Infrastructure as code
Much of this automation is fairly straightforward — for example building a service and copying the package to a remote server in a test environment. Even setting up the remote server from scratch (e.g. installing package dependencies, user accounts, SSH keys) can easily be automated with powerful configuration management tools such as Ansible, Chef or Puppet. However, how do we automate the creation of that server in the first place? This is significantly easier with the rise of virtualisation, public clouds and even software like OpenStack that organisations install in their own data centers. These provide an API and tooling so that resources can be automatically provisioned by writing scripts.
An even more useful approach that we decided to adopt is the use of declarative “infrastructure languages” such as AWS Cloudformation and Terraform. These allow you to create templates declaring what the eventual infrastructure should look like, then work out the steps to modify the infrastructure to get it to that state.
Here is an example using pseudo-Terraform. Our template specifies that we want to deploy a single server, behind a load balancer, and direct the www.crunch.co.uk DNS record at it:
Given this template, Terraform will create the resources. Now if we wanted to add a second server, we just update the template, and Terraform will only create the new server (and point the load balancer at it):
Note that there are many variables we can alter in the template. We can reuse this template in multiple environments by altering the variables.
Another issue we had with the monolith was that differences between test environments and production required us to build different binary packages for each environment, which occasionally caused issues due to differences between the two. The obvious solution was to build a single binary for all environments, but our new architecture allowed us to take this further by building a single virtual machine image for every environment. Rather than provisioning a fresh virtual machine and have to set it up from scratch, we automatically provision the server once, save a snapshot, then deploy copies of that snapshot into every environment. This has several advantages — we save time by not having to do a separate build and provision servers, and we eliminate as many differences between environments as possible. Having test environments set up with identical VM images as production means that we are far more likely to find bugs and deployment issues before they reach production.
Not being able to deploy in business hours was a big issue blocking more frequent deployments, and we needed to be able to seamlessly roll out new versions of a service with no user facing impact. The usual solution to this is a rolling upgrade, where old versions of a service are progressively replaced with new versions, ensuring at least one remains in place to serve requests:
One reason we can’t do this with the existing monolith was that each replica stores user session data in memory, which means that users have to communicate with the same replica throughout a session. We decided early on to adopt Spring Boot as a service development framework to ensure we could develop stateless services. This means that user requests can go to any replica of a service and be guaranteed to get the same response.
Having everything automated is a big help, but it’s just as important that everyone is able to be supremely confident that the automation won’t break things in production. A critical part of this is QA and testing. Previously we used to perform a lot of manual regression testing, where the team would sit down and run through a list of checks to verify that a release wouldn’t break anything. This did often catch a lot of regressions, but it was conceivable that someone could miss one of the checks and let one through. Since running through the regression took 1–2 days, it also severely limited how frequently we could release.
It’s true that developing automated tests is a big upfront investment — however the benefits are huge. Testers are freed up to actively hunt for bugs and do much more useful checking. Developers are confident that they can push a change and allow the release process to catch unforeseen issues. This confidence drives development velocity.
We’ve discussed how our journey to continuous deployment started, why Crunch has invested in it and how we’ve overcome some of the technical challenges we’ve come across.
The journey has so far been a big success. The graph below shows the number of deployments we’ve been doing since the initial development of our automated platform. In the last six months since moving from continuous delivery (manual processes) to continuous deployment, the number of development teams using it and number of deployments has hugely increased!
However, getting to this stage takes more than just investment and clever technical solutions. The next post in this series focuses more on the cultural & human aspects of moving to continuous deployment, so stay tuned.
Written by Miles Bryant, DevOps Engineer at Crunch
Find out more about the Technology team at Crunch and our current opportunities here.