Reliability, high availability, and uptime are the most important things for a DevOps Engineer; as soon as new technology are released, though, we tend to put our hands on it before anyone else does. What’s the price to pay for that? Just the risk of screwing up those principles I just mentioned. Hopefully for a short period, though.

For the last year or so, I’ve been working mostly with AWS infrastructures, Terraform and Ansible. They’re all great but, yes, they could be better, sometimes, or we could be quicker and smarter than we are, by time to time, and spot, understand and solve a certain issue that can jeopardize the stability of our web infrastructure.

This story, my story, tells about the time we decided to move from AWS’ Elastic Load Balancers (ELB) to the, relatively new, Application Load Balancers (ALB), smart move if you want to get rid of some clever proxy configurations, that make your nginx/apache vhosts look uglier and uglier, and replace them with some simple routing rules added directly to your Load Balancer.

NB: I’m not 100% sure that AWS forgot to work on something specific or left some bits and pieces behind but, as a result of my research, it seems like it. If anyone solved this issue before me without any hack, please, do let me know what I’ve missed. It would be much appreciated :)

Preparation

POC

Playing with AWS components is quite simple, especially if you use their management console from the UI; that was my starting point, after going through the ALB documentation, from which I managed to configure, deploy and test my application using the Load Balancer’s routes. It all worked like a charm; I knew I was forgetting about something, although it wasn’t very clear to me yet.

Terraform

Time to automate what had been done manually: it’s time to write a new Terraform manifest..

After deploying an ALB from scratch, using the AWS console, the TF bit was very easy as I knew exactly what I was looking for and what components an ALB needs in order to work.

Here’s an example of ALB Terraform manifest:

#######
# ALB #
#######
# ALB Security Group
resource "aws_security_group" "alb" {
name = "alb"
vpc_id = "${aws_vpc.main.id}"
ingress {
from_port = 80
to_port = 80
protocol = "tcp"
cidr_blocks = ["0.0.0.0/0"]
}
ingress {
from_port = 443
to_port = 443
protocol = "tcp"
cidr_blocks = ["0.0.0.0/0"]
}
egress {
from_port = 0
to_port = 0
protocol = "-1"
cidr_blocks = ["0.0.0.0/0"]
}
tags {
Name = "alb"
}
}
# Application Load Balancer
resource "aws_alb" "alb" {
name = "alb"
internal = false
security_groups = ["${aws_security_group.alb.id}"]
subnets = ["${aws_subnet.alb.*.id}"]
tags {
Name = "alb"
}
}
resource "aws_alb_listener" "alb-https" {
load_balancer_arn = "${aws_alb.alb.arn}"
port = "443"
protocol = "HTTPS"
ssl_policy = "ELBSecurityPolicy-2015-05"
certificate_arn = "${var.web_ssl_certificate_id}"
default_action {
target_group_arn = "${aws_alb_target_group.web.arn}"
type = "forward"
}
}
resource "aws_alb_listener" "alb-http" {
load_balancer_arn = "${aws_alb.alb.arn}"
port = "80"
protocol = "HTTP"
default_action {
target_group_arn = "${aws_alb_target_group.web.arn}"
type = "forward"
}
}
resource "aws_alb_listener_rule" "api-https" {
listener_arn = "${aws_alb_listener.alb-https.arn}"
priority = 1
action {
type = "forward"
target_group_arn = "${aws_alb_target_group.api.arn}"
}
condition {
field = "path-pattern"
values = ["/api*"]
}
}
resource "aws_alb_target_group" "web" {
name = "web-alb-tg"
port = 8080
protocol = "HTTP"
vpc_id = "${aws_vpc.main.id}"
deregistration_delay = 0
health_check {
healthy_threshold = 2
unhealthy_threshold = 2
timeout = 3
path = "/health/"
}
}
resource "aws_alb_target_group" "api" {
name = "api-alb-tg"
port = 8080
protocol = "HTTP"
vpc_id = "${aws_vpc.main.id}"
health_check {
healthy_threshold = 2
unhealthy_threshold = 2
timeout = 3
path = "/api/health"
}
}

NB: You can’t tell a Target Group which Auto scaling group it has to use but it’s the other way round, you must define the target_groups list as an attribute of your ASG.

Deployment

Terraform apply -> Staging

It’s time to see whether my new manifest works; a staging deployment sounds like it’s the most sensible choice, before going straight to production.

All the tests looked fine, the routes worked and, after switching the DNS from the ELB to the ALB, we could see the traffic coming through correctly. It all seemed perfect, although we forgot to test probably the most important bit, before going live: an application deployment using CodeDeploy!

Terraform apply -> production

ALB gets deployed without a problem and, after switching the DNS on prod too, we start celebrating since we thought we achieved a quite smooth transition. We couldn’t be more wrong!

A developer decides to deploy a new code to the web cluster. We hear the screams, not just the developer’s but the business’ too: random Bad Gateway errors, just during the deployment. Then they stopped!

Basically, this is what happened: during the deployment, neither the ASG nor the Target Group was able to deregister the instances automatically during the deployment, therefore the old code got swapped with the new one in front of people’s eyes. Not ideal, I would say!

Troubleshooting

Help me Google, you’re my only hope!

Exactly as any other person would do, I immediately jumped on Google and searched for aws autoscaling groups alb or aws codedeploy target groups and a lot of similar stuff. Nothing, nada, zip: there was absolutely nothing out there. I had two options:

  • Roll back and use the ELBs again; or
  • Roll forward and find a sensible solution!

Given that either option could take up to one working day, or more, it seemed pretty obvious trying the Roll Forward way and so I did!

Pre and Post deployment

As part of the CodeDeploy configuration, we run a couple of scripts during the deployment, especially before and after the code gets deployed.

My idea was simple: AWS doesn’t do it automatically? I shall fill the gap then, perhaps using their API and the awscli. It first sounded tricky but then, thinking about it, that was the best thing I could actually do.

The Scripts

Before install

Before installing our application, I make the script call the AWS API, looking for the target group the instance is registered with, start the deregistration and then go ahead with the rest of the deployment process. The script looks like the one that follows:

#!/usr/bin/env bash
set -e -a
AWS_TG_ARN=$1 #You could pass the target group arn as argument of the script or you could get it using `aws elbv2 describe-target-groups`
. $(dirname $0)/aws_utils.sh #Include utils script
INSTANCE_ID=$(get_instance_id)
if [ $? != 0 -o -z "$INSTANCE_ID" ]; then
error_exit "Unable to get this instance's ID; cannot continue."
fi
deregister_instance $INSTANCE_ID $AWS_TG_ARN
if [ $? != 0 ]; then
error_exit "Unable to deregister $INSTANCE_ID from $AWS_TG_ARN; cannot continue."
fi
# You should check for the instance state (draining|unused|unhealthy|ecc)

The script uses a few functions defined in the file aws_utils.sh stored in the same directory as the before_install one. Here’s what it looks like:

#!/bin/bash
export PATH="$PATH:/usr/bin:/usr/local/bin"
# Get EC2 instance id
#
# Usage: get_instance_id
#
get_instance_id() {
curl -s http://169.254.169.254/latest/meta-data/instance-id
return $?
}
# Return exit code 1 if called
#
# Usage: error_exit <message_as_string>
#
error_exit() {
local message=$1
echo "[FATAL] $message" 1>&2
exit 1
}
# Deregister instance from Target Group
#
# Usage: deregister_instance <instance_id> <target_group_arn>
#
deregister_instance() {
local instance_id=$1
local tg_arn=$2
aws elbv2 deregister-targets --target-group-arn $tg_arn --targets Id="$instance_id"
return $?
}
# Register instance to Target Group
#
# Usage: register_instance <instance_id> <target_group_arn>
#
register_instance() {
local instance_id=$1
local tg_arn=$2
aws elbv2 register-targets --target-group-arn $tg_arn --targets Id="$instance_id"
return $?
}

After install

After you’re done with your code installation/configuration, we have to register the instance back to the Target Group. Here’s how:

#!/usr/bin/env bash
set -e -a
AWS_TG_ARN=$1 #You could pass the target group arn as argument of the script or you could get it using `aws elbv2 describe-target-groups`
. $(dirname $0)/aws_utils.sh
INSTANCE_ID=$(get_instance_id)
if [ $? != 0 -o -z "$INSTANCE_ID" ]; then
error_exit "Unable to get this instance's ID; cannot continue."
fi
register_instance $INSTANCE_ID $AWS_TG_ARN
if [ $? != 0 ]; then
error_exit "Unable to register $INSTANCE_ID to $AWS_TG_ARN; cannot continue."
fi
# You should check for the instance state (initial|healthy|unhealthy|etc)

And that was it! Few lines, no drama and site up for the whole duration of the deployment.

Conclusions

What we’ve learned today:

  • Terraform is still cool;
  • There’s a rather big difference between ELBs and ALBs and not everything is documented;
  • Google sometimes gives you the answers you’re looking for, sometimes it does not;
  • Sometimes the right, or quicker, solution is under your nose, just don’t give up on it; and
  • To be honest Google helped this time too: you can find some good examples developed by AWS’ engineers here.

Thanks for reading this article and feel free to ask or suggest anything: as mentioned before, it’s very likely that I was not able to find the proper solution reading the AWS docs, although in Computer Science there’s no good or bad solutions: your code or work is good as long as it solves your problems and, more importantly, it works :)

One clap, two clap, three clap, forty?

By clapping more or less, you can signal to us which stories really stand out.