How we learnt to cling to the trunk and love our pipelines — my team’s journey to CI/CD goodness.

Here at John Lewis, for the past 4 years, we have been on a digital transformation journey to re-vamp our systems and become more agile and responsive to change. Here are a few things we learnt as we went about our journey to build and improve our Continuous Integration/Continuous Delivery (CI/CD) processes.

Make frequent and small trunk based, test driven code changes

A trunk based, test driven development culture is the key to adopting a CI/CD process, but probably the most difficult to establish.

On our team we follow a trunk based development approach religiously and branch only by exception; if absolutely necessary (this may happen at most twice in a year).

This encourages small code changes that helps to de-risk our work and forces us to think about simplifying and breaking complex problems into more managable, testable chunks that can be deployed without breaking the system. It gives confidence in our changes with the increased investment in quality and test planning.

Another advantage of trunk based development is that it increases team collaboration and trust, encouraging people to ask questions, suggest/challenge ideas and ask for help when needed.

As everyone on the team is working on the trunk, there is more visibility of code changes (beyond the developer pair that is making the changes) and opportunity for more feedback; making the code more resilient and supportable.

Trunk based development

Trunk based development helps avoid merge hell. It helps the engineers spend more time developing features and providing solutions rather than trying to resolve merge conflicts.

It is important however to schedule wisely the pieces of work that touch the same part of the system as multiple people changing the same piece of code may result in some conflicts (although much lesser than when branching).

We also have a culture of rebasing rather than merging commits which helps us to keep a clean history with clear rollback points.

Adopt a Microservice approach to development

Having smaller, well defined, distributed applications with clear responsibilities accelerates the adoption of CI/CD. They are easier to work with and debug if required.

Smaller and simpler, loosely coupled microservices make changes more manageble and predictable. Reverting changes if required also becomes easier and quicker.

On our team we have various internal well-defined microservices that are leveraged to provide the final exposed API. This design keeps changes localised and manageable without impacting the exposed API.

Implement the You Build It, You Run It operating model

Having a single team to build and run applications can be beneficial and highly cost effective, reducing hand-offs and aiding ease of change.

It increases service reliability and availability and allows engineers to take more ownership of their products and champion continuous learning and improvement.

It gives engineers an understanding of how their applications are operating and performing and the pain points that customers may have. It provides them with the freedom to decide when and how much to change/deploy and build in fast feedback loops which are key to an improved design and more robust applications.

On our team, we have learnt how our microservices behave and perform on a typical day and are able to pick up on any issues and remedy them quickly. This knowledge also helps us to make functional changes to our applications faster, reducing the time to market for various requirements.

Since our team is on a 24x7 callout rota, we make sure as much as possible that the issues we are called out for are absolutely critical and have a clear action plan. This requires prior thinking and planing as a team. Embracing a continuous improvement strategy, we either fix the root cause of any unexpected issues quickly or define clear actions for them.

Since ours is a productised team, this operating model has allowed our team to have a bigger say in the direction and shape of our product (product catalogue on the John Lewis website).

Adopting this approach is easier for green-field applications, but we have used this approach heavily for digital transformation at John Lewis and found it to be quite effective.

This model is essential for a fast paced continuous integration and delivery culture and for most of the points discussed in this article.

Build a single standard way for deployment

Establishing a standard way for deploying code to various environments/regions/container clusters pays off in the long run.

It helps engineers concentrate on building features. It gives them confidence about productionising their code as the path is already established.

Some of the things discussed below can be useful.

Adopt a build once, run anywhere method to software development. Packaging software with all its dependencies will ensure that it can easily be deployed in different environments.

Create templates and set standards that can be used to deploy packaged software to any environment. For example, if you are running applications in a Kubernetes cluster, create blueprints for engineers to bootstrap their deployment configuration using standard/custom Kubernetes kinds. This helps to quickly on-board new teams and reduces lead time for changes. This could be done by the application team themselves or by a centralised Platform team, leaving application developers to concentrate on application programming.

At John Lewis we have a dedicated award winning Platform team who have created a paved road for us with templates we can use to bootstrap our deployments to the cloud. All our cloud infrastructure as well as permissions are controlled by this paved road pipeline. Read more about the things our Platform team have been doing here.

Deployment template yaml

Consider using on-demand container runtimes to speed up deployment and ensure there is at least one production-like environment/cluster where the code can be deployed and tested.

Since all the deployment configuration and infrastructure is defined in code, a successful rolling deployment in a non-production environment and subsequent automated testing provides strong confidence that the code will work in production.

Enforce an only one way to deploy your application strategy to any environment — via the pipeline. This will make every change done visible and discourage any manual interventions by individuals.

Build a robust deployment pipeline

The aim for teams with a good CI/CD process is to strive for a state where -

Every green pipeline is deployable to production

To achieve this state it is key to know certain things about your change before you deploy it to production. A robust pipeline with stages that tell you quickly when your change does not work builds confidence in what you are deploying to production.

Green CI/CD pipeline

The stages should help to fail fast and quicken the feedback loop to you so that you can take appropriate action quicker.

Here are some of the key things that the steps in our pipelines help us to answer, which you may want to consider-

1. Does the microservice/application work as expected on its own after your changes?

Running a build of the code, which includes running automated unit tests using relevant test containers and mocks will be a good indication of whether the changes work as expected and also whether they break any existing functionality.

It is imporatant to ensure that the tests are repeatable and deterministic; ie. repeated runs of the tests (whatever type they are) on the same version of the artifact should yield the same result (success or failure).

If the results flap; i.e. sometimes return success and sometimes failure, there is either a problem with the code or the tests. When a fix is found, it should be applied to trunk and a fresh pipeline triggered.

2. Does the change adversely affect the working of the wider service landscape?

It is important to know if the changed microservice still integrates and works well with other microservices in its own service domain to provide the right outcome.

Checking what impact a small change in one microservice in your landscape may have on the others will help to catch issues with integrations/configurations early.

These tests need not cover everything in detail; just whether the various microservices in the bounded context of the service integrate correctly. Hence they could be black box tests that trigger a process(s) and expect an outcome at the other end.

Most of the time, the impact can be anticipated and the tests changed/expanded accordingly. However, these tests are extremely valuable when they fail; indicating an unexpected behaviour your change may have introduced.

On our team we have a separate Gitlab project that handles these black box tests. Its pipeline is kicked off by the pipelines of almost all of our microservices in our service landscape. This ensures that a change in any microservice is tested end to end in the bounded context of our entire service landscape.

3. Does the change affect the performance of the application adversely?

Performance tests in the pipeline and performance trends can help with this.

The specifics of what you test and how will depend on the kind of application you are testing. Some of the things to think about when deciding how to test are -

  • The application type — is your application providing a highly available and performant API, a near realtime event driven system, a file processing system or a data insight/querying system. Depending on the type, the implementation of the performance test could vary.
  • Application auto-scaling — The purpose and approach to testing will vary depending on whether your application will auto-scale.
  • Metrics of interest — Identify the metrics that you want to capture and baseline. This will change as you start to learn more but it is important to have a starting point to start automating the capture of these metrics.
  • Graphing and plotting performance trends — Use a tool to automate the visualisation of your test results. Performance trends over a period of time helps baseline the performance of an application and identify changes or blips quickly.
  • Action to be taken when the performance test fails — It is important to have a rough idea of where/how you would start to investigate a performance test failure as it could be due to the application code, libraries that may have been pulled, issues with the container itself or the cluster. Some high level thinking in the team about how to approach these would be beneficial.

On our team, although we have applications ranging across the various types, we have found most value in graphing the performance trend for our highly available and performant API. This is because we can identify any immediate changes to the trend and investigate before it reaches production.

4. Is there a security vulnerability that should be addressed?

Vulnerabilities can surface at any time and can be in the code you write, the libraries used or the container images used to deploy the application in.

Secure by design

Having a library scanner that can tell you if you have any vulnerabilities in your code can help you to to either remove/upgrade the offending libraries or suppress any false positives.

Tools like OWASP ZAP, OWASP dependency checker, Trivy or Contrast can be built into pipelines to flag vulnerabilities.

A process to address these flagged vulnerabilities should be set in place.

5. Does the change break anything for your consumers if you are a provider?

A great way of knowing if your changes will affect your consumers is by having consumer driven contract tests that run in your pipeline.

We at John Lewis rely heavily on contract tests to ensure our integrations work well between services/microservices. We use the PACT framework across teams and our Platform Team hosts a central PACT broker for sharing and storing consumer driven contracts and the verification results of those contracts.

We’ve also adapted the third party PactSwift package to make it easier to use for PACT testing in the mobile apps space.

6. Does the pipeline fail and send out relevant notifications when something breaks?

While it is important to have stages and steps that will give us confidence in our code changes, it is even more important to know when something breaks.

It is important to think about which steps you definitely want to fail fast on so that you can act quickly to resolve the issue.

There are integrations available between CI/CD platforms and communication tools that if set up correctly provide the right level of alerting to enable teams to take action.

On our team, we have Gitlab set up to notify us on specific slack channels for pipeline failures or new commits.

Pipeline failure slack notification

We apply the change to resolve the issue causing the pipeline failure to the trunk and a fresh pipeline triggered. This is to ensure that the artifact that is produced by a pipeline is immutable and gives us confidence in what we are deploying to production.

Develop an easy to use deployment/rollback process

The other important thing to consider is what you will do in case there is a problem once your changes are in production.

Rolling back to a working version of your application quickly is key to minimising impact.

Leverage functionality that your deployment platform provides you with. Build tools that aid/automate your deployment/rollback process.

We use a tool built internally that can list all the commits made since the last production deploy upto the latest successful pipeline. These commits give confidence in what we are deploying to production.

A link to the pipeline to be deployed as well as a link to the pipeline to rollback to are provided by the tool and extremely useful for quick deploys and rollbacks if the need arises. This is a simple tool that calls Gitlab APIs and is used across multiple teams for production deployments.

Deployment tool output

Establish a process for dealing with anything related to production. For example, if you use slack for alerting/communication, ensure that you can quickly find messages/alerts related to production.

In our team, we have chosen to direct all alerts related to production to a specific always-monitored slack channel to avoid trawling through un-important, non-production alerts. This keeps us focussed on the most urgent and important things from a support point of view.

We have also built a slack application bot that copies specific messages from one/many sack channels to our always-monitored production only slack channel.

Consider how you will know if something goes wrong

Think about errors/scenarios you want to know about in your system landscape and put in relevant messaging/alerting for those scenarios. This gives you confidence that you can react quickly if something goes wrong.

Think about —

  • Who wants to know and why?
  • What metrics needs to be captured to enable alerting?
  • What is the action?
Metrics

We have found that at times a number of iterations may be needed to achieve the optimum level of monitoring and alerting.

To sum it up

The process of continuous integration and continuous deployment is iterative and you learn as you go.

There is no one size that fits all and as with most things starting small and then improving on it is the key to make it work.

It is worth investing time and effort in getting the enabling behaviours like trunk based, test driven development and microservice thinking embedded in the team early to make the CI/CD adoption smoother.

Ultimately it is about making the deploy to production less scary and the process to get there robust enough to trust it.

At the John Lewis Partnership we value the creativity of our engineers to discover innovative solutions. We craft the future of two of Britain’s best loved brands (John Lewis & Waitrose).

We are currently recruiting across a range of software engineering specialisms. If you like what you have read and want to learn how to join us, take the first steps here.

--

--