Bouncer: Simple AWS Auto Scaling Rollovers

We built bouncer to make it easier to operate AWS Auto Scaling groups (ASG). It cycles running instances to ensure they match what is defined in launch configuration. Auto Scaling groups are not just useful for responding to load, but also for increasing reliability. AWS does not offer a built-in way to cycle an Auto Scaling group, unless you are using AWS CloudFormation which is not a sufficiently robust tool for many environments. This makes deploying new launch configurations into an ASG time consuming and boring. That is where bouncer comes in: it can replace all of the instances in an Auto Scaling group with new ones automatically. A simple bouncer invocation looks like:

bouncer serial -a <asg-name>:<desired-capacity>

Bouncer will look for instances not running the latest launch configuration, terminate them, then wait for the replacement to successfully launch before moving on to the next instance. Any instances which fail to launch would abort the bouncer run.

Auto Scaling groups

Engineering infrastructure in the cloud means engineering around failure. In a world of transient failures, automated recovery is necessary. Auto Scaling groups provide an effective solution for recovering EC2 instances without human intervention. They can handle common workflows such as restarting an unhealthy service, replacing a degraded instance, and rolling out new versions. At Palantir we run almost every EC2 instance inside of an Auto Scaling group, even some services that only have a single instance.
 
Auto Scaling groups are configured by specifying a desired number of hosts and a template known as a launch configuration. A launch configuration specifies the AMI ID, instance type, security groups, and other customizations. However, Launch configurations are instructions on how to create new instances, not a requirement for what current instances should look like. Changing the launch configuration of an Auto Scaling group does not trigger any action, which means it’s up to the operator to take action to roll out the new nodes while terminating the old nodes. This can makes routine operations (such as rolling out an updated image) costly.
 
This leaves the operator in a position where the desired state (launch configuration) doesn’t match the deployed state (the deployed nodes). It’s also not obvious in the AWS console that your Auto Scaling group consists of both new and old nodes. This process is ripe for automation in order to minimize the chance of the desired state staying diverged from the actual state.
 
 In order to automate this process, you need a solution to:

  1. Identify that a new launch configuration has been created
  2. Determine which nodes are old and which are up-to-date
  3. Terminate old nodes and wait for new nodes to come up healthy
  4. Report status

We wanted to solve this problem in a simple way that didn’t require building a stateful application or trying to juggle a convoluted set of Lambda functions, so we built bouncer.

Bouncer

Bouncer is a simple go binary that automates transitioning all instances in an ASG to updated launch configurations. It is designed to be run as a stand-alone binary or part of a CD process. It can operate in two modes, serial and canary:

  • Serial mode is an in-line deployment strategy, deleting and adding instances one at a time.
  • Canary mode is a blue/green deployment strategy: first test a single new instance, then add enough new instances to double the desired capacity, and finally remove all old instances once the new ones are in service.

Serial mode works well for services that do not scale horizontally. For example, our Jenkins server uses a persistent EBS volume to store job data. Using serial mode means that the old instance has to release the EBS volume before the new instance is launched. This simplifies the automation by guaranteeing the EBS volume will always be free.
 
Canary mode works well with services that scale horizontally. For example, our Jenkins server will happily accept 50 more Jenkins workers and will tolerate the subsequent loss of the 50 old ones.

At Scale

At Palantir, we run hundreds of services from many different sources across thousands of hosts. They all have many different configuration methods, file system layouts, and lifecycle management. Managing all this configuration by hand is not practical, so as a company, we’ve standardized on using Terraform and Terraform Enterprise as CI/CD for infrastructure. While Terraform manages and automates updating of launch configurations across all of our services, it does not come with a built-in mechanism for cycling an AWS Auto Scaling group.
 
Prior to bouncer, service owners had their own rollover methods, usually a 1-off script for cycling the nodes as desired. We wanted to incorporate our rollouts into the Terraform CI/CD pipeline. The solution was to have Terraform invoke bouncer to perform the rollover as part of the apply:

resource "null_resource" "instance_bouncer" {
triggers {
lc_change = "${aws_autoscaling_group.service.launch_configuration}"
}
  provisioner "local-exec" {
command = "./bouncerw serial -a '${aws_autoscaling_group.service.name}:${aws_autoscaling_group.service.desired_capacity}'"
}
}

The triggers portion of Terraform’s null_resource is an interpolated string which informs Terraform when to re-run the specified provisioners. In this case anytime the launch configuration of our Auto Scaling group changes, this null_resource will be re-run, triggering bouncer. Bouncer will iterate through each old instance one-at-a-time, terminate it, and wait for the replacement to be launched and become healthy. This integration with Terraform works very smoothly.

  • If a Terraform apply succeeds, it means that the Auto Scaling group has been fully cycled.
  • If bouncer fails to cycle the instances, it will fail the Terraform apply.
  • the service owner can then investigate, fix the issue, and re-run the Terraform apply.
  • bouncer is designed to smoothly handle re-running after a failed rollout. See the README for more details.

By cycling the nodes as part of the Terraform apply, there is a much tighter loop between the declared state (code) and the deployed state. In a world where the the nodes are cycled out-of-band from the code, the gap between the Terraform code and what is deployed becomes greater.
 
Ordering. Sometimes services have many components which need to be bounced in a particular order. Terraform is designed to understand resource dependencies and resolve them in the correct order. Our calls to bouncer from a null_resource allow us to codify the order. Take for example, hashistack (a Consul, Vault, and Nomad cluster). Consul and Vault servers need to be bounced serially in order to maintain quorum. However, Nomad workers can be cycled using the canary strategy:

resource "null_resource" "consul_server_bouncer" {
# Changes to any instance of the cluster requires re-provisioning
triggers {
lc_change = "${join(",", aws_autoscaling_group.consul_server.*.launch_configuration)}"
}
  provisioner "local-exec" {
# Redeploy all nodes in these ASGs
command = "./bouncerw serial -a '${join(",", aws_autoscaling_group.consul_server.*.name)}'"
}
}
resource "null_resource" "vault_server_bouncer" {
# Changes to any instance of the cluster requires re-provisioning
triggers {
lc_change = "${join(",", aws_autoscaling_group.vault_server_individual.*.launch_configuration)}"
}
  provisioner "local-exec" {
# Redeploy all nodes in these ASGs
command = "./bouncerw serial -a '${join(",", aws_autoscaling_group.vault_server_individual.*.name)}' -p '${join(",", formatlist("./%s/vault-step-down.sh %s %s.%s", path.module, aws_autoscaling_group.vault_server_individual.*.name, var.vault_dns_name, var.zone_name))}'"
}
  depends_on = [
"null_resource.consul_server_bouncer",
]
}
resource "null_resource" "nomad_worker_bouncer" {
# Changes to any instance of the cluster requires re-provisioning
triggers {
lc_change = "${aws_autoscaling_group.nomad_worker.launch_configuration}"
}
  provisioner "local-exec" {
# Bounce all nodes in this ASG using the canary method
command = "./bouncerw canary -a '${aws_autoscaling_group.nomad_worker.name}:${var.worker_count}'"
}
  depends_on = [
"null_resource.consul_server_bouncer",
"null_resource.vault_server_bouncer",
]
}

We leverage the depends_on attribute of Terraform resources to explicitly define the dependency graph for our service rollovers. Now we can rollout complex services in the correct order, failing when any upstream error occurs.
 
You may notice that we are using bouncerw instead of bouncer. It is a script which downloads the latest version of bouncer and then passes all of the arguments to the binary. This is a pattern we use often at Palantir, allowing us to update commonly used binaries without committing the binary directly into source. The script contains the code to download and prepare the binary which makes the invocation simpler.

Health Checks

When you’re relying on automation to perform infrastructure rollovers it is essential that the automation is able to evaluate the health of the new parts before moving on. Automatically replacing old healthy nodes with new unhealthy nodes is not ideal. Bouncer must be able to determine if instances are healthy before moving on.
 
Bouncer is not guaranteed to have network access to any of the instances it is cycling, so the health of an instance is determined using the AWS API. AWS Auto Scaling groups provide two basic ways to handle this: the EC2 health check or an ELB health check. The EC2 health check is almost always too simplistic for most purposes as it essentially only confirms that the instance is running and is reachable via the network, obviously necessary but not nearly sufficient. ELB health checks are useful in many cases because it means that user traffic is flowing to the instance. However, we have several clusters which we run in Auto Scaling groups where one or more instances are hot standbys. Theses instances intentionally appear as unhealthy to the ELB so as not to receive user traffic. Enter Auto Scaling lifecycle hooks. 
 
Lifecycle hooks pause launch or termination of an ASG instance. We use the launch hook to wait for an instance to become healthy before continuing the launch process. The contract for Auto Scaling groups that use bouncer is that once an instance reaches the “in service” state it can be assumed to be healthy. It is imperative then, that if the health check fails it is correctly reported to the hook. The health check takes the form of scripts that run on each instance and evaluate the readiness of the service(s) which are supposed to be running. These scripts are also responsible for making the api call to proceed or not. Each service is now able to utilize a customized health check, while keeping bouncer’s logic simple and compatible with more basic use-cases.
 
Here is a simple test to check if a Jenkins server is up, and informing lifecycle hook that it’s safe to proceed.

while [ true ]; do
curl --fail -sL localhost:8080/login -o /dev/null \
&& break \
|| sleep 10
done
aws autoscaling complete-lifecycle-action \
--lifecycle-action-result CONTINUE \
--instance-id $(ec2metadata --instance-id) \
--lifecycle-hook-name my-launch-hook \
--auto-scaling-group-name my-asg

Wrap Up

One of the places where this workflow has had the biggest impact is on rolling out security patches in a timely manner. Bouncer allows us to take newly patched AMIs and rely on Terraform Enterprise to roll them out to all of our services. If there are failures, they show up as Terraform apply errors, so an operator no longer needs to manually check each service. Instead, they simply look for failures in Terraform Enterprise. Now, one person can patch dozens of services with minimal effort. If an error does occur, and is not caught it is easy for us to add new checks to our scripts. As a result we are able to improve the quality of our rollout process.
 
We are passionate about operating infrastructure efficiently, if you are too, come work with us. Interested in using or contributing to bouncer? Check it out on GitHub.

Authors: Andrew K., Elliot G., Holt W., Umer S.