Deployment Testimonials: Issues and Challenging the Architectural Orthodoxy
I lead a team of immensely talented engineers maintaining a critical application that is at the nucleus of my organization’s IT map. The advantage of being in such a team is the fact that you get to gauge the impact of your changes by looking at the effect it has on other teams and consumers. A major disadvantage, if you have not already guessed, is this same dependency and the pressure it brings along and a very thin margin for error. The application that my team manages used to be a large monolith and had a single source of non-replicable data, with the downstream systems being tightly coupled to this. Breaking this down into a host of microservices was a colossal undertaking. But that would be a story for yet another fewer-meetings day.
Fast forward to a time when we are managing a suite of 60 odd, loosely coupled, context bound microservices. But effortless deployments were still a challenge that we had not fully won over. We were still relying on after-business hours releases, redundant release definitions and complex release cycles. With a super agile team and a massively dynamic workitem backlog, the need to come up with a way to upgrade our deployments was well overdue. Along with the team, I came up with a couple of options and chose to organically revamp the whole system. Putting together the details of this exercise in one place would be gross over-simplification. Here is my first attempt at doing that.
Blue/Green
Why?
We started with simple resource swap deployments. Where new production resources are tested on a set of servers prior to swapping the resources out in production. This process is popularly known as Blue/Green deployments. For those who are unfamiliar with this, essentially there are two exact copies of the application’s source code, designated “Blue” and “Green” respectively (not exactly sure why those two colors). And we started seeing really good results with this approach, pretty early on.
Advantages
One of these Blue/Green stacks is the external facing set of servers, while the other stays unavailable to the users. Essentially this second set of servers becomes a smoke test environment on top of the production stack. By using this inactive set of servers as our deployment target, we could test our application’s behavior with the new codebase prior to deploying the functionality to live users. This allowed us to reduce the downtime in deployments while also improving the overall resiliency of the application. Improvement in resiliency was achieved by the way of providing an easy fallback in the event of a sudden increase in scale or the occasional botched deployment.
Challenges
The blue/green approach can be very cost-intensive, since the core premise of this is duplication of application’s production resources. While the actual work of setting up this kind of environment is greatly eased due to advancements in cloud-resource availability, the costs of such an install are nearly double as those of a less secure option.
Canary
Canary deployments aim to address these issues by reducing the need for duplicate application infrastructures. With some tweaks to the application’s architecture and to the development practices, we could reduce the need for maintaining redundant stacks and deliver the same level of stability. Canary releases have always been used by development teams whenever there is a new or big change that has to be introduced. But giving our stakeholders and teammates, a surprise, at the same time might not work in everyone’s best interest.
I have worked on teams that have extensively used feature flags. Which are an ingenious way to isolate certain features. But as developers world over are growing more and more curious about the upcoming releases, it becomes more and more strenuous to manage internal releases. Especially if the changes correspond to a company that is due on releasing a new product. And thus even an innocuous miss around the code-names or dead code, might cost the company a lot in PR cover-up. Phased or incremental rollouts could be another way of gradually folding the changes in.
Canary releases shift the focus from releasing entire applications to releasing individual features within an application. So, instead of releasing all of the new features at once, as part of a mammoth monthly release, we started releasing the code for new features while slowly scaling up the number of users that have access to that feature. And by employing multiple canaries each tagged by a different feature toggle across various geographical regions we were able to get early success indices. All this while, gradually updating the traffic allowed on the canary from under 5% to a full 100%. Another upside is that it enabled us to continuously release new features for the application without needing a specific deployment or a release window.
Observability and Defining Success
In the process of switching traffic to users, we had to identify the user sets. What should be the criteria for defining the target user base? We started with identifying the users that are relatively more active.
Once done, we had to engage the marketing and public relations team to seek out consent and request feedback. This would help gauge the issues early on. For internal product teams dogfooding is an easier approach.
Choosing success metrics for such a series of releases becomes more and more critical since the numbers from each stage help in identifying any early stage issues. Increased response times for an extrapolated throughput, increased error rates or other measurable entities were some of the key metrics.
Challenges
One of the major challenges, we know, would be cleaning up after or managing multiple release versions and constantly reducing the number of parallel versions currently live in production. Database changes were a real challenge to work around. Parallel Change or expand, merge and contract pattern can help mitigate this to a large extent. This aspect is yet to be thoroughly explored.
References:
https://martinfowler.com/
https://docs.microsoft.com/