ECS Service Discovery with Lambda, DNS, and HAProxy

One of the benefits of moving to a containerized architecture is the ability to dynamically expand and contract service capacity as needed. But doing so introduces the problem of Service Discovery, and where to find the current set of live containers at any given time.

While there are existing service registry solutions such as Consul, etcd, and Zookeeper, each requires their own infrastructure to be set up and managed. Instead of running a dedicated service registry infrastructure, we can utilize existing AWS services CloudWatch, Lambda and Route 53 to act as a service registry, and HAProxy to proxy and load balance among the current set of live containers.

Architecture
As our ECS containers change state, they emit CloudWatch events. For our Service Discovery problem, we are interested in the events that represent when a container task has started or stopped within a particular service cluster.

We route these events to a Lambda function which subsequently registers and de-registers the IP addresses of the EC2 instances hosting the container tasks in a private hosted zone associated with our VPC in Route 53 as SRV records.

HAProxy is configured to keep the list of backend servers for our service up to date via DNS Service Records, and proxy and load balance among the current live servers.

Let’s implement this solution using Terraform to see how it all fits together.

Private DNS
First we need to set up a Private Hosted Zone in Route 53 associated with our VPC. This is where our service discovery SRV DNS records will live. Please note the potential foot gun when Public and Private Hosted Zones have overlapping name spaces, which can break DNS resolution!

The Terraform configs for our Private Hosted Zone:

Obviously you will need to provide values for the variables above.

CloudWatch Event Processing
ECS emits lots of different events related to the EC2 instances our containers run on and the container tasks themselves. For our Service Discovery problem we only care about events representing when our container tasks start and stop, so we can register an Event Rule which allows us to filter out all other events besides these.

The Terraform configs for our Event Rule:

The contents of the ecs_events_template.tpl file is as follows. You can read more about Event Patterns here.

Next we need to set up a Lambda function to listen for the CloudWatch events pertaining to our service containers. You can find an example Python version here. Please note I’m not a Python guy, so this may not be idiomatic Python! :)

The Terraform configs for our Lambda function. This not only configures the Lambda itself, but also permissions and a log group where we retain the logs for the Lambda function.

Finally we need to target our Event Rules to our Lambda function.

HAProxy
The final piece of the puzzle is HAProxy, which acts as a Service Discovery agent, as well as a proxy and load balancer for the instances in our service.

As of version 1.8, HAProxy now has the ability to use DNS for Service Discovery.

First we need to add a resolvers section to our haproxy.cfg pointing at a nameserver to resolve DNS queries. The resolver should be the VPC private resolver, which is always available at the “+2” IP address of the configured CIDR block. For example, if your CIDR block is 12.0.0.0/16, the private resolver would be found at 12.0.0.2. We configure this via the NAMESERVER environment variable.

Next we need to configure the backend for our service to use a server-template which will be used for any instances discovered via DNS, and configure it to use our private resolver from above.

Now when new ECS tasks are started in the cluster and registered in DNS via our Lambda, HAProxy will pick up the change and start routing traffic to the new instances. You should see output similar to this in your HAProxy log files:

Jan 24 11:54:12 localhost haproxy[1]: myservice/myservice2 changed its FQDN from (null) to i-01263310c34aca952.myservice.internal by 'SRV record'
Jan 24 11:54:12 localhost haproxy[1]: myservice/myservice2 changed its IP from to 10.0.3.187 by awsdns/dns0.
[WARNING] 023/115412 (1) : myservice/myservice2 changed its IP from to 10.0.3.187 by awsdns/dns0.

When ECS tasks are stopped in the cluster and unregistered from DNS via our Lambda, HAProxy will pick up the change and stop routing traffic to that instance. You should see output similar to this in your HAProxy log files:

[WARNING] 024/010222 (1) : Server myserver/myserver2 is going DOWN for maintenance (No IP for server ). 2 active and 0 backup servers left. 4 sessions active, 0 requeued, 0 remaining in queue.
Jan 25 01:02:22 localhost haproxy[1]: Server myserver/myserver2 is going DOWN for maintenance (No IP for server ). 2 active and 0 backup servers left. 4 sessions active, 0 requeued, 0 remaining in queue.

You can either run a single HAProxy node for the entire service cluster, or run HAProxy as a “side car” application on the same servers as the service clients. There are obviously pros and cons to both approaches.

References