Upgrades Without Tears Part 2 — Blue/Green Deployment Step By Step on AWS
In the previous blog post of this two-part series, “Part 1 — Introduction to Blue/Green Deployment on AWS,” we covered the benefits of blue/green deployment as a software release technique and how it simplifies production system upgrades.
AWS can help startups manage such deployments in a cost-effective and low-risk way. We provide a comprehensive set of services and features designed to take the complexity out of managing environments, lower your deployment risk, and help you focus on what’s important: improving your product and growing your user base.
In this part 2 post, we explain the blue/green deployment process and dive deeper into what happens at each step.
Stand Up the Green Environment
There are several foundational services that you will likely use to successfully implement blue/green deployments on AWS:
- Amazon Elastic Compute Cloud (EC2) provides resizable compute capacity in the cloud. These EC2 instances (virtual machines) will run your software applications.
- Elastic Load Balancing (ELB) automatically distributes incoming application traffic across multiple Amazon EC2 instances.
- Amazon Virtual Private Cloud (VPC) provides a logically isolated networking platform to deploy your Amazon EC2 instances and other resources into.
- Amazon Route53 provides Domain Name System (DNS) resolution services for your applications.
Additionally, in the previous post, we discussed AWS Elastic Beanstalk as a great service that you can use to easily deploy and manage multiple environments. We also mentioned AWS OpsWorks as a service designed to model more complex architectures, trading off some of the ease of use of Elastic Beanstalk for increased flexibility.
Regardless of the deployment and automation solution you pick, AWS allows you to provision the resources you need on-demand, within minutes. You can simply run both the blue and green environments side by side during the deployment, and then simply turn the unneeded one off, to stop accruing charges for the resources used by it.
If your blue and green environments are in the same AWS region, and you don’t have to account for schema changes, you can have both environments tap into the same data sources, significantly simplifying your deployment process.
If you do need to account for schema changes, a simpler way to handle that is to decouple schema changes from code changes. By using decoupling, you can choose one of two approaches:
- Database updates are backward compatible
- Code changes are backward compatible with the old schema
Both approaches ensure that during a blue/green deployment, both the blue and green environments can sync to the same database, simplifying the process itself. Because the database schema will be updated either at the beginning or the end of the deployment process, you will have less opportunity to test the changes in a controlled environment. Therefore, this approach places a heavier emphasis on thorough testing in your previous stages of the development lifecycle. You need to ensure that database changes are truly backward compatible and the old code will work with the new schema during deployment, or you need to ensure that the new code is backward compatible and will work with the old schema during deployment.
If your blue and green environments are in different AWS regions, or your schemas diverge to a point that backward compatibility is not feasible, you’ll need to ensure that both environments start off with the same baseline data set. As the traffic switch-over progresses, both datasets should stay in sync, regardless of whether a given user activity occurs in the blue or green environment. The green one needs up-to-date data because it will become the new authoritative environment. The blue one needs up-to-date data in the event of a rollback.
The initial data process to bring both the blue and green environments to a common baseline can be achieved through replication. If you use a relational database such as MySQL, you can configure master/slave replication. Amazon RDS is a fully managed database service that allows you to easily provision MySQL read replicas, even deploying them in different regions from the master. With Amazon DynamoDB you can leverage update streams to synchronize changes to other tables, even in different regions. Other NoSQL data store engines, such as Cassandra or MongoDB, have built-in tools to perform replication.
Once the blue and green environments both start receiving traffic, you need a system in place to manage writes so that changes are effected in both environments. The particular solution for your application and the implementation of that solution depend on the complexity of your schema changes and your application’s data synchronicity and consistency requirements.
Test the Green Deployment
If you have reached the deployment stage in the lifecycle of your application, you typically have already completed the development, testing, and QA phases of the lifecycle. However, you should ensure that the green environment is deployed and configured according to the design specifications, that the environment is operating as intended, and that the testing you performed in previous stages accurately reflects this environment. From this perspective your green environment acts as a canary, or an early warning system: if you’re experiencing issues, they are unlikely to go away once you start sending more traffic to it.
Performing an additional round of UAT, QA, and smoke or load testing is recommended depending on the level of confidence you have in your testing activities in previous stages. You can use tools such as Selenium or Apache JMeter to run automated tests.
Green Becomes the New Blue
Once you are confident that your green environment is production-ready, you can begin to move traffic over from the blue environment. To do this, you use the weighted DNS routing capabilities of Amazon Route 53 to gradually switch a set percentage of the traffic over to the green environment. At the same time, you monitor the health parameters of your green environment to ensure the new application is operating correctly.
So how fast do you transition the traffic over? That depends on a few factors:
- The time to Live (TTL) of your DNS records: you want to give the DNS system enough time to propagate the new changes, and also give your users enough time for the DNS cached values to expire. Badly behaved clients may ignore your TTL altogether and continue sending traffic to the old environment for a much longer time.
- If you use Auto Scaling, you should give your green environment enough time to scale out to accommodate the increasing traffic.
- Similarly, if you use Elastic Load Balancing (ELB), and you route a very large volume of traffic to a new load balancer in a short space of time, it will not give the new load balancer enough time to scale out. Pre-warming your load balancer ensures that it is sized to handle the amount of traffic that you are expecting to receive, rather than the amount of traffic it is currently receiving. You can contact AWS using the support options available in your Management Console to request pre-warming.
You can simply switch over the DNS to point to the new green environment, without a gradual transition. This is the fastest option, it can be easily scripted, and it’s sufficient in certain scenarios. In fact, the zero-downtime deployment option of Elastic Beanstalk is designed to do just that. However, if your green environment starts failing under load, almost all of your users will be affected in this scenario. In contrast, the following diagram illustrates a more gradual transition.
As depicted, using Amazon Route 53 weighting, you can start by sending 10% of your traffic to the green environment, then increase the weighting to send more traffic to it, until eventually all traffic is routed to the green environment. Use the time between DNS weight changes to monitor your environment and confirm it is operating correctly and handling the increased load at each step of the way. This option takes longer to transition and can also be automated by programmatically changing the Amazon Route 53 weighting based on Amazon CloudWatch metrics. Amazon CloudWatch, provides a comprehensive monitoring solution for your resources running in the AWS cloud. It tracks metrics such as the CPU utilization of your EC2 instances, or network load received by your load balancers, among many others.
This approach allows you to go beyond the earlier green testing and test with live traffic, knowing that any serious issues that arise will impact only a small percentage of your users, and that you can mitigate the impact quickly by changing the weights to push traffic back to the blue environment.
Note that traffic might still flow to the blue environment for a short time, because of client-side or other DNS caching that you have no control over. So don’t tear down the old blue environment until your monitoring tools tell you that there is no more traffic going to it.
Practice Makes Perfect
Your users aren’t going to wait three months for that next feature or even one month for that important bug fix, so you need to be comfortable with upgrades. Blue/green can help. One way to get comfortable with this style of deployment is to create production-like replicas (you can do this in a different AWS account for added safety) and have everyone on your team perform deployments. You will quickly learn not to fear deployments. You’ll stop saving up all those great new features and bug fixes for that big release at the end of the month, and you’ll be able to respond to your users’ needs more quickly.
You have access to a host of affordable, quickly provisioned resources in the AWS cloud. Use this to experiment, to try out new approaches, and to validate your processes before they get near production.
Do you want to benchmark your system with 20 small web servers instead of 10 large web servers? Go for it, just try it out. Want to completely replicate your production system in QA for a day to make testing 3 million users more realistic? You can do that. Want to roll software updates out to production every day of the year? Why not?
Oops, Something Went Wrong
Things go wrong, sometimes in a big way. What happens if, despite your best efforts, you find that both blue and green environments are broken (maybe a database schema change wasn’t properly tested?).
Here is where Amazon Route 53 health checks and DNS failover can help you. Prepare for this situation and build a static catch-all error site. You can host that site on Amazon Simple Storage Service (S3), our secure, durable, highly-scalable object storage service, and configure Amazon Route 53 DNS to fail over to that site. This will allow you to automate the failover, communicate an outage explanation to your users, and perhaps provide them with alternate methods of reaching you while you work on bringing the site back up.
Summary
Using blue/green deployment is a tried and tested method that reduces the risk of deploying production updates, minimizes downtime, and gets new features to your users more quickly.