Wise Engineering
Published in

Wise Engineering

State of our CI/CD pipeline (Part 1)

The first of a series of two articles on the state of CI/CD pipeline @ Wise, describing our developer workflow and CI approach.

Photo by Quinten de Graaf on Unsplash

Enabling product teams to quickly ship new features and move fast is key for a company’s growth and consolidation: while iteration speed does matter, it’s important not to underestimate the potential negative impact a fast pace could have on the overall system reliability and security (e.g. outages, security incidents, etc.).

At Wise, our Platform Engineering team has been measuring productivity metrics like Lead Time, Deployment Frequency and Change Failure Rate (see Accelerate Book and Four Key Metrics) to constantly track our performance, alongside with availability SLOs to assess the impact of changes on our reliability.

With a growing engineering base and service fleet (both in the 600 figures), any inefficiency / friction in the delivery pipeline gets severely amplified and can affect product teams velocity and impact. For this reason we decided, as did many other companies, to invest early on in developer productivity.

In this series, we will touch on our CI/CD (Continuous Integration / Continuous Delivery) pipeline for product teams, highlighting the challenges and limitations we faced, and describing how we are planning to evolve it and fill the gaps. Part 1 will cover developer workflow and CI pipeline, while in Part 2 we will dive into challenges and vision for CD.

It’s worth mentioning the series focus mainly on backend and frontend services, but our CI infrastructure serves our Android builds as well (and maybe iOS in the future).

Designing a CI pipeline

Our CI team, serving an average of 12k jobs a day across multiple platforms, recently migrated more than 1000 GitHub repositories from CircleCI to (self hosted) GitHub Actions: despite this complex migration being supported by automation, it required a big coordination effort among more than 60 teams. As a result, teams now have a cohesive experience on GitHub, from opening a pull request (PR) to verifying checks and build status within the same, consistent and familiar UI.

But let’s see what a typical developer workflow looks like and how we structure our CI pipeline (Fig. 1):

  1. a feature branch is created from the main branch;
  2. engineers develop and test locally and / or in cloud dev environments;
  3. when suitable, engineers push the branch to the remote and open a pull request against the main branch;
  4. CI checks, like unit / integration / functional tests, static analysis etc. start;
  5. change management policies apply and must be satisfied in order for the PR to be merged;
  6. artifacts are validated and uploaded to the artifacts repository.
Fig. 1: simplified example of our development workflow and CI pipeline

In order to better understand this pipeline, let’s break it down and cover the most important parts.

Local and cloud development

A fast feedback loop is key for effective development, so we generally use docker compose locally to spin up the required containers needed for integration or end to end testing (e.g. database, Kafka, etc.). We’re also still using Testcontainers in our JVM services, but due to performance reasons, we’ve been moving towards docker containers.

When a service needs to be tested in a wider context, engineers can also run exploratory testing spinning up our internal, k8s powered, version of cloud development environments (A.K.A. custom environments) and interact with the production running version of other microservices, adding custom fixtures and selecting only the required services.

As an alternative, they can connect via a local Envoy proxy to our staging environment (especially common for frontend projects), but in this case isolation is definitely not guaranteed.

Change management policies and auditability

As a regulated business in multiple countries, we strive to mitigate and reduce the risk of getting changes to production impacting our customers and us: this means enforcing standards across thousands of repositories, defining guardrails and processes to guarantee traceability of those changes as well.

Given our weak code ownership model, that meant developing an internal GitHub bot, which enforces meaningful, mandatory code reviews (at least 1, not stale) based on codeowners and defines a set of mandatory requirements (e.g. mandatory labels, successful tests, PR template, etc.) for a PR to be merged.

We also built some failsafes, like break the glass solutions, which allows us to bypass the checks in specific situations (e.g. incidents), but those events are monitored and then reviewed to avoid abuse.

Beside the validation / enforcement aspect, the bot was designed to automatically collect evidence as well, which can be used during external audits, streamlining the toilsome gathering process.

Artifact validation and promotion

Once the mandatory checks are all satisfied, the pull request can be merged: that will trigger the above mentioned checks but it will also build and upload the docker image or library (e.g. client) to our Artifactory store.

Before making an image available to be used in the production environment, we run a promotion process, where we validate the provenance of the artifact (e.g. built from protected branch, metadata checks, etc): if the artifact is compliant, it will be pushed to the production registry and ready to be pulled.

It’s also worth noting that some teams, as part of the validation process, deploy the image to our staging environment and run some smoke tests, but this is not something enforced in our platform paved road yet.

Boundaries

Some CI solutions blur the line between CI and CD, allowing to deploy or promote an artifact across environments: we, instead, have been adamant in defining clear boundaries between CI and CD, enforcing separation of concerns at network level as well. This means that our self hosted GitHub Actions runners cannot directly deploy into our production environment: we believe that the security benefits coming from the segregation outweigh the friction of using a better, specialised tool.

As a consequence of this separation, our CI process ends with uploading an artifact to some storage (e.g. S3, Artifactory) and delegates the discovery to the CD tool or to some listeners, either via webhooks or queues.

Evolving our CI pipeline

Our CI team carried out a value stream mapping exercise in order to identify pain points for engineers: that highlighted some gaps in our metrics and processes, such as regression in build times and lack of notifications on job failures just to name a few. In light of this feedback, the team is now actively working on collecting more data (via instrumentation) and addressing some of these issues.

Securing our pipeline is also an ongoing effort: our Application Security team is looking to integrate our separate Software Composition Analysis (SCA) processes and workflow with the CI pipeline, enabling a new set of use cases for our engineers (e.g. on demand container scanning).

But our efforts to secure our pipeline go further, as software supply chain attacks have been a hot topic in recent years, with big investments from many companies to help guarantee the integrity of the supply chain. We started looking at SLSA (Supply chain Levels for Software Artifacts), a security framework to safeguard the provenance of an artifact, and the integration with GitHub Action, but, despite being promising, it’s still early days.

We also run an internal PoC of container signing with sigstore cosign, so we will keep an eye on the evolution of the project and the (thriving) community.

Beside the security aspect, GitHub has also been recently releasing some interesting features, like shared private workflows, which could help us better standardise the CI configuration and improve the overall offering for our teams.

Conclusions

In this article we summarised our developer workflow and CI pipeline, outlining some of our ongoing and future efforts on securing the supply chain and improving developer productivity. In the next and final article, Part 2, we will present our journey and vision on CD.

Thanks to Lambros, Nick and Shadi for the feedback.

If you enjoyed reading this post and like the presented challenges, our Platform Engineering team is hiring! Check out our open Engineering roles here.

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Massimo Pacher

Massimo Pacher

@massimo_pacher on Twitter. Principal Engineer @ Wise.