Zero-downtime Blue Green Deployments for Microservices

dan twining
8 min readAug 3, 2020

--

This article explores Blue Green deployments, a technique for rapidly and repeatedly releasing changes to software in production. It’s also referred to as Red Black deployments, particularly by the Netflix development community, where they present the technique in a subtly but powerfully different way; something we’ll explore below.

TL;DR - If you remember only one thing…

As you transition from the “old” to the “new” version of your software, there can never be only one. In a distributed system, there is no instantaneous switchover from “old” to “new”; for zero-downtime deployments there must always be a period when some part of the system is interacting with both the old and the new versions. You will need to reason about this, and deal with the consequences.

In the beginning

Blue Green deployments (2010) predates cloud computing (2013), microservice architectures (2014) and cloud native software patterns (2019). When we talk about Blue Green deployments in general, we’re talking about a pattern that can be applied to both monoliths and microservices, and its primary goal is to help with the cutover when switching between the “old” (blue) and “new” (green) versions of software running in production.

Getting started with Blue Green — No microservices yet

Consider a single-instance application running behind a load-balancer and calling down to a database. We’ll call this running system “blue”. Rather than pulling down this application before deploying a new version in its place, we can instead stand up a new application and database alongside our existing “blue” stack, and call this “green”. Once we’re happy that our green stack is ready, we can flip the load balancer over so that green handles all the traffic, and then clear out blue to leave only green running.

Future deployments can use exactly the same pattern over and over again; just consider “green” to be the new “blue” and off you go. There are several variants to this; for example, you may choose to share the database rather then stand up a new one, but the general principle remains the same.

Enabling downtime

It’s not always possible to have your “green” stack 100% ready to serve customers at the point at which you disconnect “blue”. What if blue is writing state to the database that green will need in order to process subsequent requests? For applications where some downtime is required between the start and end of the cutover, we introduce a third state; some sort of “we’re down for maintenance” stub that can respond to customers gracefully whilst we make the final updates necessary to green. Then, once green is ready, we flip the load balancer back over to the green stack and away we go again.

The goal here is to minimise downtime, but not eliminate it. You stand up as much of your green stack as you can prior to going into “maintenance mode”, and shrink the amount of activity required in the downtime cutover period as much as possible, but that downtime period still exists.

But we’re cloud-native! We’re building microservices! We don’t do downtime!

So how do we apply this pattern to our cloud-native architectures? As I’m sure you can imagine, the likes of Amazon, Google and Netflix don’t just apply Blue Green deployments as described above to their systems as a whole, copying the entire stack and cutting over with downtime every time that they want to update a component in production.

The properties of cloud-native architectures, and how they enable always-on, constantly changing systems to thrive in an environment of constant change, are best articulated in Cornelia Davis’s book on Cloud Native Patterns, and I’d encourage anyone looking for a deeper understanding of this whole cloud-native thing to start there. For the purposes of this discussion, well just assume that you’ve already released the first version of your microservice, and have multiple instances of your microservice running behind a load balancer. If we call these running services “blue”, then the Blue Green deployment pattern can be easily implemented by:

  • Standing up a matching number of “green” instances of your microservice that contain the change that you wish to deploy.
  • Once you’re happy that those green instances are healthy, adding them into the load balancer so that they receive traffic.
  • Removing the blue instances from the load balancer, and once they have finished processing any inflight requests, throwing them away.

This process is shown in the animation below.

Red Black: The Netflix way

Fundamentally, Red Black deployments follow exactly the same steps as Blue Green deployments, and you’ll find many articles on the web that simply describe them as “the same thing”. For me though, there’s one important difference that’s worth exploring.

With Red Black deployments, your existing “red” microservices are running behind the load balancer as before. In order to deploy your change, you stand up new instances of your microservice as more red instances, adding them into the load balancer. Old versions of your instances are then removed from the load balancer; these are the “black” instances, no longer serving traffic but being kept alive so that if anything goes wrong with the new version, they’re ready and waiting to be re-added to the load balancer. Then, once you’re satisfied that the new red instances are performing correctly, the black instances can be destroyed, leaving only the new set of red services containing your latest change.

The key difference between Blue Green and Red Black deployments is conceptual, and is in how they describe the point at which both versions of the microservice are running behind the load balancer; I’ve highlighted that step in the diagram below.

With Red Black deployments, the services responding to traffic are always red. Red Black focuses on the continuity of service, not the introduction of change. In the lefthand side of diagram above, all of the instances at this point are red because, from an operational perspective, these instances are all the same. Load is being balanced across all of them, all of them need to respond to the same requests from the clients, and no client-facing features in the new version can be leveraged yet as there is no guarantee that a new version of the service will receive the request. It is only after all of the instances of the old version have been removed from the load balancer can any new feature be leveraged by the outside world.

Blue Green, on the other hand, focuses on the incoming change; green is different, even when mixed in with the other blue services. It can give developers the idea that you switch between two standalone states; you start with green and end with blue; rather than continually updating a running cluster of red instances. For monoliths, flip-flopping between “green” and “blue” can be an entirely appropriate pattern, and may also provide useful disaster recovery capabilities too, but that is not the version of Blue Green Deployments that we want to apply our within microservice architecture. By reusing the same words but for different things, I have seen developers deploy changes expecting an instantaneous cutover and then being surprised when the deployment hasn’t completed without interruption or failure.

Further Considerations

One other important point to remember is that everything that is true of the upstream clients is also true of any downstream dependencies. If the microservice talks to a datastore, then the new instances of the service cannot interact with that datastore in a way that isn’t then compatible with the old instances running alongside it hitting the same datastore.

It is precisely this consideration of having some period of time where both versions of the service are running within the system as a whole, simultaneously, and yet still interacting with both upstream and downstream services, that is the key concept to bear in mind when using Blue Green zero-downtime deployments to promote change to your microservice-based environments.

Finally, I should also mention that you’ll need to consider any limits you may have around the maximum number of resources that you’re happy to run. Rather than standing up an entire set of duplicate microservices, you may need to stand them up individually and roll them into the load balancer on a one-in-one-out basis, or in batches of x, or whatever else makes sense in your context; and if you’re using a platform like Kubernetes to do this for you, then you’ll need to understand what choices the platform offers around how to roll out your services (more on that later).

I’m a developer, and my services are deployed using this pattern; what should I do?

This probably deserves an article in its own right, but very briefly, you will need to consider the following:

  • Write stateless apps that gracefully scale horizontally, both up and down. The mechanism that you use to scale your apps, by adding or removing instances behind a load balancer, is exactly the same mechanism that is then used to deploy new versions of your app. Apps that scale gracefully are the starting point for apps that deploy gracefully.
  • Think about how you apply changes to production, and how you can split a change over multiple deployments, where each deployment is compatible with the last. With a distributed system, gone are the days of being able to synchronously update both sides of an interface; you need to decompose your changes so that change is applied in an incremental way and that the system remains healthy at all points in the deployment process.

Blue Green Kubernetes

With Kubernetes becoming the de facto operating system for microservices in the cloud, it’s worth exploring how Kubernetes handles deployments, and how that maps to the Blue Green Deployments pattern. I’ll cover that in a future article.

Further Reading

The starting point for any exploration of Blue Green Deployments is most likely Martin Fowler’s article on the subject.

For a more detailed look at how Netflix uses the Red Black model to deploy changes, I’d recommend reading the ‘Deploying the Netflix API’ article on the Netflix Tech Blog.

Finally, I have already recommended it above, but would suggest that anyone looking to build, operate and evolve anything remotely “cloud-native” should consider reading Cornelia Davis’s book on Cloud Native Patterns.

Thanks for making it this far, you deserve a cup of tea!

--

--