Blue Green Deployments in datacenter

At a client, one of the major challenge with production deployments was to have zero downtime. In such case blue-green deployment model worked out to be the best option.

How it Works?

In blue-green model, two identical application stacks are maintained (for convenience, I will refer to our application server instances and service instances as a stack, database deployments are done separately . One of the stack is always live, let us say green is live and blue is standby. When we want to deploy a new build to production, we deploy it on blue stack and do thorough testing. Once new build seems to work fine, the blue stack goes live and green becomes standby. This helps us do a quick rollback if issues are identified during deployment or sanity check.

What are the challenges ?

  1. The tricky bit here is to switch between blue and green stacks with almost no downtime.
  2. An overhead here is to maintain two stacks and hence adds to the cost.

Well, obviously we overcame it !

To solve the cost issue, instead of having one stack always in standby mode, we divide our production infrastructure among two stacks and keep both stacks live. This is done by creating two Chef environments — production_blue & production_green, as shown above. During deployment, we detach the blue stack from the live ELB and attach it to standby elb. Then deploy the new artifact to it. Once the sanity testing is done, then the blue stack is made live. Now we repeat steps for the other stack.

Next, is to achieve zero downtime. AWS takes a few seconds to serve requests from an instance after we attach it. Our infrastructure is currently hosted in AWS but it is supposed to be datacentre ready, where we can’t reliably predict the time needed for DNS entries to propagate. To work-around this we relied on health checks: we implemented two health check HTTP endpoints for app — GET /live and GET /standby. Live ELB polls /live and Standby ELB polls /standby. On an app instance, only one health-check will return 200 OK, which we refer to as enabled, and the other will return 500. This can be controlled by using the HTTP endpoint: POST /enable/(live|standby). In the normal state, blue and green stacks are attached to the Live ELB and poll /live.

Say, we want to deploy the new artifact on the blue stack:

  • We start by enabling /standby end-point on all the instances of the blue stack
  • This causes the Live ELB to remove the blue instances from service and ensures that the no new requests are sent to these instances but the old requests are served without any failure
  • We detach blue stack instances and attach them to the Standby ELB once the old requests are served and traffic reduces to zero
  • The new artifact is shipped to blue stack instances
  • After the deployment, the /live endpoint is enabled and they are re-attached to Live ELB
  • The same process is repeated for the green stack.

This helped us achieve practically zero downtime !