Immutable Deployment @ Quorum

Published in

Qbits

8 min readApr 5, 2017

Photo Credit: https://www.edureka.co/blog/what-is-ansible/

Software deployment naturally involves a lot of risk, QA testing time, and headache. When done manually, it causes downtime and usually has to be accomplished late at night when traffic is minimal. The whole process is unpredictable, unscalable, and if done incorrectly, can severely impact sales and business operations.

At Quorum, we know first-hand how manual deployment works (or breaks) because we have “been there, done that.” Much like our transition from a Django-template-dependent site to a single page application (which you can read more about here), our software deployment strategy has evolved through many stages, most notably thanks to the addition of Ansible to our stack about one year ago.

Why Ansible?

When it comes to DevOps tools for configuration management, server provisioning, software deployment, and [insert-crazy-buzzwords-you-can-put-on-your-resume], Ansible is still a relative newcomer compared to the likes of Chef, Puppet, or Salt. At Quorum, we rely heavily on open-source tools and software, but are also not afraid to dig in and make something better ourselves. Ansible not only ships with many powerful modules that can be immediately executed on remote hosts over SSH, but also a fully customizable nature that allows us to bend the rules and build custom modules that meet our needs. Besides all of those benefits, Ansible also:

is written in Python, the main backend language at Quorum, which facilitates the process of building and debugging custom modules.
features playbooks written in YAML, which is human-readable, has parsing libraries in most programming languages, and just so happens to be the syntax of choice for many Continuous Testing frameworks like Jenkins or Shippable.
has an active community that is growing every day (~22k ★ on Github).

All things considered, choosing Ansible as our main DevOps tool was a no-brainer. That being said, what really sealed the deal for us was the release of Ansible 1.8, which was dubbed the “new immutable deployment killer.”

Immutable Deployment

Amazon Web Services (AWS) was founded in 2006. What that means is that 12+ years ago, if you wanted to deploy a web application at scale, the required infrastructure would look something like this.

Silicon Valley, Season 2, Episode 5: Server Space

Today, software engineers no longer manage physical servers. They are all “in the cloud,” which in layman’s terms means you can easily spin up, configure, or tear down any server at will. Consequently, there really is no need to deploy a new application on an existing infrastructure, especially when you can easily spin up an identical environment in minutes.

The advancement in cloud infrastructure gives birth to many new DevOps approaches that would had been impossible just a few years ago. Immutable Deployment is one of those approaches, and it simply means:

Immutable: the “staging” environment, once ready to become production, doesn’t change. If we need to change something, we then deploy new code on completely new infrastructure. There are, of course, some exceptions to this (we still irregularly touch the production instances to hotfix an urgent backend bug), but we’re furiously working on reducing deployment time and eventually shutting down SSH access to existing instances.
Deployment: automating the process of taking the development team’s hard work, merging into a release branch that is then fully tested in a “staging” environment, and finally “pushing” the new software available for our clients to use.

Step-by-Step

Before diving deeper into the Ansible code block, let’s first understand Immutable Deployment at the “pseudo-code” level.

A “staging” server is created from a release branch, typically as an AWS EC2 Spot Instance. Spot is chosen because we simply have no need for this server after deployment finishes, we don’t really mind if it gets outbid at any point (ok maybe a little bit if we’re in the process of making an AMI from this spot instance), and it’s bloody cheap compared to other purchasing options that AWS offers.
The deployment team will go through this staging server and “play” with it, otherwise known as test the sh*t out of staging and make sure nothing breaks. While this might typically be the job of a QA/testing engineer, most developers often get to wear more than one hat in a startup environment. It is fairly typical for an engineer to juggle frontend, backend, data, and QA responsibilities in the same day. (Oh, a client just requested a mobile app update? Great, go read the Big Nerd Ranch guide on iOS/Android and come back in a week with your new mobile hat.)
Once staging is in a happy place and release notes have been compiled, we switch the staging instance to be using production settings.
We then make a green image from the staging instance (more on this later). This is the Amazon Machine Image (AMI) that we will then use to spin up as many production instances as necessary using AWS AutoScaling.
Now we’re ready to deploy. But first, we need to make a Launch Configuration (LC). Think of it as the blueprint for all soon-to-be-launched instances. The LC specifies the ID of the green image, an SSH key-pair, one or more security group(s), among various other settings.
A new Auto Scaling Group (ASG) is then created, while the old one behind the Elastic Load Balancer (ELB) also coexists. Nothing has changed from a client’s perspective. Once instances in the new ASG has “warmed up” and passed the health check, we register instances in said ASG to the existing ELB, and traffic is effectively being served to those new instances. This is known in practice as blue-green deployment (also known as red-black or rolling deployment). It ensures minimal downtime because the only routing switch needed is that from the old set of instances (blue) to the new one (green). We’ll see later that this technique also reduces risk should things go “dev-ooops.”
At the same time, we also deregister instances in the old ASG from the ELB; otherwise, clients will see two different version of the software depending on how the load balancer routes requests. This might be the intended behavior if you are doing canary deployment, but it’s not what we are currently practicing. Note that the old ASG is still kept around.

At this point, deployment is essentially done, assuming everything goes well. However, developers are only human, and humans make mistakes. Crazy as it sounds, a fat-finger typo took down the Internet earlier this month. If, right after deployment happens, we discover that there’s something wrong with the new release, we can easily switch back to the blue image because the old ASG running it is still around. We simply de-register the instances in the new ASG and re-register the instances in the old one to the ELB. Traffic is now served through the old set of instances, as if a new deployment never happened in the first place.

Photo Credit: http://searchitoperations.techtarget.com/definition/blue-green-deployment

The Ansible Playbook, In All Its Glory

Now that we’ve seen the pseudo-code, translating it to working YAML code is necessarily the next step. Our spot EC2 instances are launched using Ansible’s ec2 module while the AMI is created with the ec2_ami module. Everything is pretty straightforward so far, so let’s go ahead and skip to where we make a new Launch Configuration and Auto Scaling Group.

Notice the use of'{{ variable }}'. As a good general programming practice, moving these configuration values into the inventory file (a place where all the variables are defined) means we can reuse them between hosts, and if we need to tweak them in the near future, we only have to do so in one place.

A few noteworthy items from the Github GIST:

replace_all_instances: yes Once the Launch Configuration is connected to the Auto Scaling Group, all existing instances will be replaced with what the new LC describes.
wait_for_instances: True We wait for the instances to pass the health check before switching traffic to this new set of instances.
until: ec2_asg_return.viable_instances|int >= desired_instance_size|int The ASG keeps spinning up new instances until the number of healthy instances is at least the desired number we want in production
retries: 5 and delay: 60 If for some reason, the instances are deemed unhealthy (perhaps our wait_timeout value was not high enough, or a high error rate from AWS), we delay 60 seconds between attempts and retry up to 5 times.

Once the new ASG is created and we have attached alarms and scaling policies to it, the final step is to detach the load balancer from the old group. This is unfortunately not easily accomplished with Ansible built-in modules (as of this writing), but we can quickly write a custom module using boto3’s detach_load_balancer method on an Auto Scaling Group.

Once created, we simply call this custom module in our playbook as follows.

Should things go wrong, we reattach the ELB to the blue set of instances that we simply set aside and tear down the green ASG.

Conclusion

No deployment strategy is perfect, and Immutable Deployment is not an exception. There are certain challenges to it, such as:

Fixing problems, specifically frontend-related problems, requires a complete deployment cycle. From bundling JavaScript files via Webpack to syncing the static assets to S3 to making a new image and then an ASG, the whole cycle could take about half an hour, not including testing time.
Immutable Deployment when database is involved is difficult, and if done wrong, reverting to the blue set of instances alone will not solve it. Let’s say an engineer decides to rename a column of a database table. The migration is made right after traffic is switched to the new ASG. We then realize there’s an undesirable bug on production, so we revert the code to the blue set of instances. At this point, the code on production does not match the database schema, and it causes production to crash if a rollback is not applied immediately.

With that being said, Immutable Deployment has allowed us to deploy new code to production during the day, occasionally multiple times a day, without actually taking the site down, SSH-ing into the instances, and git pull -ing the new changes. As the growth of Quorum’s user base has accelerated and expanded beyond the nation’s capitol, we can no longer count on our engineers staying up past midnight waiting for the number of users to drop to zero. Our deploy cycle is now entirely automated thanks to Ansible and its powerful AWS modules.

Acknowledgements

I’d like to thanks the entire Quorum team and especially our tech cofounder Jonathan Marks for entrusting me with deployment and DevOps responsibilities. I’ve learned many new things in just about 9 months since I started, and I’m sure there are still lots to learn in the near future.

A very special thanks to Leo Hentschker, who worked on deployment and DevOps-y stuff @ Quorum before me, without whom none of this would be possible.

Interested in working at Quorum? We’re hiring!