State of our CI/CD pipeline (Part 2)
The second and final article on the state of CI/CD pipeline @ Wise, describing our CD journey so far and how we are planning to implement our vision.
Wise’s migration to microservices was a clear driver behind the early company investment in developer productivity: reducing lead time and enabling product teams to quickly deploy their changes, despite the increased architectural complexity, was key to foster our growth.
In Part 1 of this series, we presented our developer workflow and CI pipeline, highlighting some of our future efforts in terms of security and productivity. We also discussed how we enforced a clear separation (even at network level) between CI and CD systems, using artifact repositories (e.g Artifactory) webhooks or queues as triggers for the CD pipeline.
In this final article, we will present our Continuous Delivery journey, describing some of the challenges we’ve been facing — serving more than 150 deployments a day — and how we plan to address them.
Designing a Continuous Delivery pipeline
With the migration of our services to Kubernetes, we developed an internal tool, called Octopus, which allows product engineers to easily deploy their images across multiple Kubernetes clusters and use advanced techniques like canary releasing to reduce the blast radius of a faulty change.
Figure 1 shows a partial snapshot of our Octopus tool for a specific product service, with custom integrations for our telemetry stack: Grafana, Prometheus, Rollbar and Kibana.
Over time, we also tried to address some friction points, providing contextual links for our tools (e.g. metrics / logs for a specific pod or for canaries only) and allowing some sort of customisation for teams (e.g. adding their own business metric dashboard).
Finally, to improve visibility over SLOs, we integrated the availability SLO error budget of each Tier 1 and Tier 2 service: as of now, we don’t have a strict error budget policy, so teams can still proceed with a deployment.
Supporting canary deployments
In order to implement a flexible canary releasing support, where services can optionally define how much traffic (% of requests) to send to canary pods, we had to extend both our in house service mesh control plane (based on Envoy proxy) and our monitoring stack, propagating the correct metadata to be able to split telemetry by version.
To ease monitoring for product teams, we also scripted a streamlined, release focused dashboard (see Fig. 2), with the aim of helping engineers to identify anomalies and reduce the recovery time: to make it available both for frontend (SSR) and backend services, we abstracted the underlying runtime (e.g. Java, Node) metrics using Prometheus recording rules.
Using feature flags for staged releases
At Wise we recognise the importance of distinguishing between release and deployment, so our engineers can use our in-house, centralised feature service to define feature flags and stage the release of a certain feature according to specific criteria (e.g. 10% of users, only users from a specific region, etc). Feature flags management lives in a separate tool, as those runtime changes do not trigger any deployment, and status changes are monitored as standard deployments.
Limitations and new requirements
Octopus has served us well so far, enabling more than 150 deployments per day across our service fleet, but one of the most appreciated features by our engineers, its simplicity, is also its biggest weakness.
Lack of flexibility and paved road
Given the stateless nature (being just a caching k8s API wrapper), it’s not possible to define complex / flexible workflows or release pipelines (e.g. promote to environments, canary and then rollback / rollout). It also doesn’t enforce a paved road, making the process completely manual and error prone: we baked more guardrails in over time, following some incidents, but they were ad hoc actions, not major rethinking.
Cognitive load and manual, time consuming, processes
Overall, the whole process can be quite time consuming, as engineers need to manually monitor the rollout: context switch could happen (e.g. Slack) and that would inevitably delay the rollback, affecting our MTTR.
Cognitive load is also pretty high for users: canary assessment, as the duration of it, depends on the expertise and knowledge of the releasing engineer, making it daunting for new joiners not too familiar with our monitoring stack.
To add extra complexity to the process, Octopus doesn’t support the creation of a baseline deployment (running the production version but receiving the same % of traffic of the canary one), so engineers must compare the canary deployment (receiving ~3% of the traffic) with the primary one (~97%).
The imbalance in the traffic split between versions could skew results, e.g. long running pods might behave differently from freshly created ones (see memory leaks, cold start, etc).
Evolving our CD strategy
Following the new requirements and limitations highlighted above, we defined a vision, independent from tooling, for our CD pipeline, to complement our CI offering.
Figure 3 shows a simplified view of our vision, based on multi stage validation and promotion across environments: some of the stages, e.g. chaos testing, are long term goals, but we wanted to make sure our tooling could be extended to support them in the future.
We assessed if we could extend our internal tool to support those use cases, but we quickly realised that it would have taken quite some effort and we would have reinvented the wheel, implementing just another CD tool.
We ran a deep requirement analysis, considering open source and commercial options, but we eventually settled on Spinnaker, being open source (with a great, active community) and battle tested at bigger companies like Netflix and Airbnb. Lots of those companies, as well, have shared their experiences in migrating to Spinnaker, highlighting caveats and resulting impact on the overall organisation, which (hopefully) will help us avoid some issues down the road.
On top of the mentioned points, automatic canary analysis (ACA), flexible pipeline management and multi target were also big selling points for our EC2 / Kubernetes setup, allowing a unified experience and single pane of glass.
We deployed Spinnaker on our staging k8s clusters (via the open source spinnaker operator) and integrated it with our internal systems (e.g. service mesh, monitoring stack, etc.), with the goal of assessing if the tool could satisfy our requirements.
We decided to test our CD vision for product teams, so we created a managed pipeline template (as code, via sponnet) standardising the way we want teams to deploy (e.g. mandatory canary step, default canary config, automatic rollback, safety checks, etc).
We then simulated the onboarding process for a candidate service (automating the process via a GitHub Action workflow) and assigning the standard pipeline template (see Fig. 4).
Finally, we defined a base canary configuration (see Fig. 5), based on the standard telemetry we collect across services, evaluating response codes, latency, error logs, resource usage etc. We split those metrics by group, like Availability, Latency and Utilisation, applying a different weight to each one in the final canary score computation.
We then generated traffic in our staging environment to simulate different scenarios (e.g. completely broken, new version, partially broken or degraded version, etc) and assessed our pipeline robustness and efficiency: initial results were pretty good, so that confirmed we were on the right way.
Customising Spinnaker and moving forward
The initial, promising results, convinced us to move forward and productionise our setup: as of now, we’re actively working on deploying Spinnaker to production, making sure we tick all the requirements of our security review (e.g. RBAC, mTLS, network policies, etc.), given it will be able to deploy to any of our Kubernetes production clusters.
To satisfy some of our internal requirements, we customised the UI extending Deck and modified few other services adding extra capabilities for auditability (e.g. event logs) and instrumentation (e.g. OpenTelemetry for distributed tracing). Given our internal services run on Spring Boot 2, as the Spinnaker ones, it was relatively easy for us to reuse our internal knowledge and extend them.
From an operational perspective, we’ve been automating the whole provisioning and environments configuration (via Kustomize), from databases (via Terraform) to the mTLS setup. In the future we might consider using SPIRE as we do for our services, but we deliberately decided not to increase the complexity of the setup for the initial launch.
We’ve also sketched out a staged migration plan, defining what to measure and how to incentivise teams to move over to the new platform: Airbnb has shared interesting insights on their migration and how they approached it at scale.
Finally, we have already started contributing back to the OSS community and we intend to continue doing so as we mature our setup, so if you’re interested in these challenges and OSS in general, we have a number of roles open in Platform Engineering.
In this article we discussed our CI/CD pipeline, highlighting some of the limitations and issues we faced while growing as a business.We also outlined our continuous delivery vision and our plan to evolve how we can safely release changes for our customers, introducing automation to remove toilsome and error prone processes and enforcing standardisation and guardrails across the organisation.
Thanks to Lambros, Nick and Shadi for the feedback.