Static IP Applications on AWS ECS

Here at ACL, we face several kinds of challenge. From cutting edge, to feature enhancement, to clever problem solving. This article is about the latter.

Background

A bit of background is needed for context before we jump into the juicy technical detail. Around mid-August this year we finished migrating most of our applications to AWS Elastic Container Service (ECS), it was a long project that took approximately 18 months to finish. Unfortunately, one of our applications was challenging to migrate due to a third-party component developed to run on pets instead of cattle. That is, the application made assumptions that it would run on the same Linux host throughout its lifetime. So, this raised several road blocks we had to go around for our particular application to work in a dynamic environment such as ECS.

Challenge

This third-party component, being developed without cloud-computing in mind, poised the following challenge when deployed on our shiny, new, ECS infrastructure:

How to deal with dynamic ECS tasks changing IPs and ports all the time?

More specifically, the application made assumptions that the IP address of the underlying host would never change, and that it was always able to run on port 80. Neither of these assumptions are guaranteed when using ECS. Besides, this third-party component runs on multiple servers and they need to communicate with each other in a mesh fashion.

Attempt #1

Our first thought was to use two different Application Load Balancers with host-based routing listeners and specific Route 53 fully qualified domain name (FQDN) records. One load balancer would be responsible for handling communication between different ECS tasks running inside the application. In these cases, each task communicates with other internal tasks by referencing internal DNS name, such as app1.local. While another load balancer would receive requests from our users at, for instance, app.public.com.

Unfortunately, this idea wasn’t possible because it required the same ECS Service to be associated with several Target Groups, which isn’t supported. Although we could manage the tasks manually, we didn’t want the extra overhead. Here’s a diagram of the failed idea.

Attempt #2

Our second idea involved using awsvpc with fixed IPs. What awsvpc does is attach an Elastic Network Interface (ENI) directly into your container, effectively giving it an exclusive IP address. The plan was configure our ECS Tasks to use awsvpc and associate a private IP to the created ENI.

Turned out this attempt was also not supported because the ENIs created using awsvpc configuration is managed exclusively by Amazon, which means we, as a user, can’t do anything to them. Besides, we were also unable to use awsvpc in our ECS Task Definitions because of its incompatibility when using the Docker links configuration.

The Solution—Third Time’s a Charm

Based on attempt #1 (Route 53 FQDN records idea) and attempt #2 (ENI update with a specific IP address), we realized there was another way: Update the Route 53 FQDN records with the IP addresses. That is, as ECS starts up a new container, we find the IP address of the container’s host, then modify the Route 53 FQDN record as needed.

In addition, we needed to map the application container port into the host port 80, using Docker’s bridge network driver, allowing incoming web traffic—we could have used host mode, but we decided to play it safe because we haven’t tested our applications in this network mode yet. This permitted us to talk to the application through the private IP used by the EC2 instance when reaching it from inside the AWS Virtual Private Cloud (VPC). We achieved this adding the portMappings configuration into our ECS Task Definitions, which is equivalent of running docker run -p 80:80 our_image:latest.

"portMappings": [
{
"hostPort": 80,
"containerPort": 80
}
]
Container port is statically mapped to host port 80

Next, we needed to add the FQDN record somewhere. For this, we used a Private Hosted Zone in Route 53, in other words, a “private DNS” that only services inside the VPC can access. In our case, we associated the domain local for this Private Hosted Zone.

Private Hosted Zones
FQDN records

So now, because the ECS Tasks are dynamic, can be placed in any EC2 instance and also change over time, how can we add the EC2 instance private IP into the Private Hosted Zone in Route 53? CloudWatch Events and a Lambda function to the rescue.

Every time an ECS Task starts, ECS emits a CloudWatch Event called “ECS Task State Change”, which contains, among other information, the Container Instance ARN, ECS cluster name, Task desired status, Task current status, and the AWS region. The trick now is to get the EC2 instance IP address, which is possible with a small chain of API calls throughout different AWS services:

  1. Call describeContainerInstances from ECS API to get the EC2 instance ID
  2. Then call describeInstances from EC2 API with the ID returned in step #1 to get the EC2 instance IP address

Now, having all this information, we were able to dynamically add the required FQDNs in Route 53 using a Lambda function triggered by this CloudWatch Event.

A small detail to keep in mind is that every time there’s a state change in any ECS Task, an ECS Task State Change is sent. That means ECS will send an ECS Task State Change every time an ECS Task transitions between states, such as pending, running, stopped, etc.

Therefore, because we didn’t want our Lambda function triggering in every single event — only when the task is effectively running — we created the CloudWatch Event Rule with the following pattern:

{
"source": [
"aws.ecs"
],
"detail-type": [
"ECS Task State Change"
],
"detail": {
"desiredStatus": [
"RUNNING"
],
"lastStatus": [
"RUNNING"
]
}
}

This ensures that our Lambda function is only triggered when the ECS Task is indeed running.

Therefore, this is the solution in a visual way — abstracting the CloudWatch Event Rule — with their respective steps.

Step by step
  1. ECS task is executed and has its port mapped to port 80 in the EC2 instance
  2. ECS sends the “ECS Task State Change” Event and triggers the Lambda function
  3. The Lambda function reaches out to EC2, ECS, and Route 53 to get the information needed
  4. The Lambda function adds or updates the record according to what is valid or outdated
  5. [Failure or Autoscaling] A new EC2 instance comes online
  6. [Failure or Autoscaling] The old EC2 instance is terminated

Conclusion

This was an interesting problem to solve and we ended up leaning the details of ECS Task State Change and how we can leverage the information available there for our needs. The solution’s implementation, though, is highly likely to also work when using awsvpc network mode — although I didn’t test it myself. I’ll leave this idea here in case someone wants to check it. ;)

The downside of this solution is that we had to compromise and run only one application container per EC2 instance in the port 80.