The buzzword you hear every other day — Continuous delivery. Everybody does it — Google, Facebook, but nobody is sharing the nitty-gritty details of how they are doing it.
In Wandera, we moved from one release done every 2 weeks to 25+
releases a week and in this article I will describe our path and what impact it had on teams and the organization — as each team is responsible for releasing (and running) the feature they developed. The release is done by the push of a button, almost like this:
How we released in 2017
To show you the path, let’s look at how we were releasing our product (software as a service — running in the cloud) in 2017:
We had several cross-functional Scrum teams (frontend, backend, QA), working in 2 weeks iterations to build features, which then had to be handed over to Operations to release. At the start of 2017, the gap between these two departments was tremendous almost as if there was a real wall. It often resulted in a slower delivery of features with an Operations dependency.
In terms of releases in 2017, we had 25 releases (one every sprint) and many hotfix releases since with such big releases (80+ tickets) something always broke.
Our backend is based on microservices, but we weren’t utilizing the full potential of what the microservices bring — most notably independent releases — as we always released all changes in bulk after the sprint.
If you finished your feature at the beginning of the sprint, you had to wait till the end of the sprint, then a week of regression testing and plan a handover to Operations. Your feature was released the following week, with a lead time of 3 weeks since you finished it.
There must be a better way — Platform 2.0
At the end of 2017, we realized this was not sustainable anymore and embarked on a journey to revamp not only our release process but also the Platform we were running our services on.
We formed a team that started building a new platform based on Kubernetes, with these goals:
Reimagined platform that allows us to deliver features faster with less operational burden.
Allow teams to deliver features at the push of a button to Production whenever they want.
The team had to learn a lot as there was nobody experienced with Kubernetes in Wandera and at that time (2017) Kubernetes was just gaining momentum. Many things like configuration management, monitoring, and release pipelines had to be built with the help of open source components like Prometheus.
No wonder it took us a year to build the foundations of the Platform. Teams also had to migrate their services to the new Platform — dockerizing each service, adding integrations to the new monitoring system and configuration management, defining how much CPU and memory the service consumes and many other things.
In December 2018 we onboarded one feature team with a rather isolated subsystem onto Platform 2.0 and the new release process— where they had been given the power to release features by themselves to Production. For the first time since Wandera was established, somebody else than Operations could release to Production.
You release it
Throughout 2019, we worked with this team and polished the process and pipelines before we onboarded more teams to the new platform and the new way of releasing.
Now to the nitty-gritty details how the new release process looks like.
For many companies, the releasable unit is a commit. For us, it is a JIRA ticket (user story, bug). Once the ticket is verified, the team initiates the release.
We practice Continuous Delivery, not Continuous Deployment where the changes are promoted automatically to Production. We leave that up to the team when they feel the feature is ready to be released.
Our release pipelines are defined in a commercial tool called XL Release. We started with Jenkins, but very soon we found out the pipelines are unmaintainable and couldn’t be restarted. XL Release does the job we need very well.
For visibility we heavily use Slack. The pipeline creates a notification like this — where you can see all the information we provide to the person doing the release as well:
How can the person performing the deployment know that the release he just pushed to production didn’t break anything? We rely on automated testing on Staging environment and we have a comprehensive set of monitoring dashboards in Grafana where the person can check when the release was done and how it changed the metrics:
If something is off and the release person believes it would be better to roll back the feature, he or she can initiate the rollback right from the pipeline, rolling back the service and config version for each service which was changed.
If everything is OK — the feature behaves as expected and there are no problems visible on dashboards or in logs — the person who approved the release adds ✅ as an emoji to the Slack notification, which concludes the release.
As our releasable unit is a JIRA ticket, we wanted to show in JIRA what was released or what is waiting to be released — so we added a new state called “To Release” to our workflow:
This changed the Definition of Done for every team — teams have to release the ticket, only then it is Done and counted towards their velocity. The transition is automatic — when the pipeline with the specific ticket finishes, it is moved automatically from “To Release” to “Done”, along with the release date filled in.
Throughout 2019 we onboarded every feature team in Wandera to the new release process. As teams changed to DevOps ones — not only building but also releasing and running their services, we wanted to support them and create truly cross-functional teams.
We embedded an Operations person to almost every team so he can help with monitoring, alerting or other operational aspects of features. But he is not a silo in the team — as everybody from the feature team is allowed to release to Production — developers and testers are helping as well, improving alerts or dashboards.
And what happened with Operations? They turned into the Platform team and along with the team that originally worked on Platform 2.0 are improving various aspects of the infrastructure and release pipelines.
There are no walls anymore.
Benefits of Continuous Delivery at Wandera
What has the new process brought us? As more teams started releasing by themselves, we have seen:
- increased ownership — teams watch how the feature works on Production and they naturally don’t want to release a faulty change, which led to more thought through rollout process and increased quality.
- less risky releases— before we had 80+ tickets in a single release. If something went south, it was many times very hard to identify what change caused it. Teams are now releasing a single ticket or a very few together, which makes it much easier to diagnose problems.
- lead time significantly reduced — the last mile — the release — was reduced from 3 weeks in the worst case to hours as teams are now pushing features to Production after verification.
- quick turnaround for bug fixes — we have seen a fix to a bug being released to Production 3 hours after it has been reported.
- rollback in minutes, not hours — our new release pipelines support rollback on a click of a button and we are always rolling back just a single change. Previously, we could rollback either the whole release (of 80+ tickets) or a single microservice, but nobody knew what it could cause as we didn’t test for this scenario.
- experimentation — one Product Manager wanted to see how a new widget could look like with production data. The team quickly added the widget to our frontend app, hid it under a feature flag (so it was disabled by default) and released it.
Areas for Improvement
We have seen not only benefits but also a few drawbacks with the new release process and are actively working on improvements in these areas:
- time spent releasing — as we de-centralized release management, teams need to spend time releasing their features. Most teams take it positively as they now have control over releases, but some see it as an additional responsibility — especially if the release takes a long time.
- speed of releases — it usually takes 2 hours or less to perform a release due to the number of tests we are running to verify the release. Time is further increased if tests fail and you need to investigate it. We are working hard on stabilizing and parallelizing these tests so each release can be finished in around 15 minutes. For every team, the release should then become a routine and a quick operation.
- shared services — even in our microservices architecture we have a few services where multiple teams contribute to. It happened that these teams wanted to release at the same time, but not together (because they perceived it as too risky), so they created a queue of releases. The service in question will most probably be split in the end (per Convay’s law), but till then we are still finding the most efficient way how to perform releases of shared services.
- automated validation of the release — we can automatically check metrics after the release against a baseline to assess if there is no performance regression and the quality of the release is as we expect, so there is no human action needed at the end of the release. Since we just started with the new release process and it is quite hard to define the baseline reliably, we haven’t started working on this one yet.
Over the course of 2 years, we moved from 2.28 releases (that includes hotfix releases) per sprint and 65 people in Engineering to 50 releases per sprint and 125 people in Engineering.
This has been possible by the new platform based on Kubernetes and process changes I described in this article. It was transformational for the whole company and well worth the investment.