Fixing Slow AWS ECS Network Performance

Bill O'Neill
The Quiq Blog
Published in
3 min readJun 28, 2016

There is a well known issue with Docker where two images on the same machine have very poor network performance when communicating with each other. We have empirical evidence of a repeatable and consistent 10x slowdown. The Docker bug is documented here and here. You can get around this issue for now by passing the “net=host” parameter when you start your Docker image. The problem is AWS does not support “net=host”. This article presents a solution to that problem that’s pretty easy to implement.

AWS ECS (Elastic Container Services) allow you to put your Docker container in an ECS Task Definition, although AWS does not support “net=host”. Without this, when your Docker images communicate with one another on the same host in an ECS cluster, performance is horrid. This was a showstopper for us. We use a lot of microservices and they run on arbitrary nodes in our auto-scaling ECS cluster. They do share the same node with other microservices and all direct node communication is significantly impacted. There are several ways to fix this. What I am providing below is just one way, but I found it to be the most straightforward for our use.

The Fix

I found a few forks of the AWS ECS client on Github to solve this problem and created my own patch based on some of that work. I forked the AWS code from master on 6/24/16 and created a Docker image that forces ECS to always use net=host.

Installing the Fix

With a new ECS-enabled image, you only need to update the “user data” section when you build your instance. If you have running instances, you will need to update and restart the ecs-agent. If you are using auto-scaling groups with ECS, you’ll need to update your other scripts. The first two are shown below. The third depends how you set up your auto-scaling group, but the examples below will help.

Update Existing ECS Instance

Simply replace the ecs-agent Docker image and restart ecs-agent:

stop ecs

#Remove the installed AWS Docker images

docker rm ecs-agent
docker rmi amazon/amazon-ecs-agent

# Replace the AWS Docker image with a patched Docker image

docker pull quiqcorp/aws-ecs-net-host:1
docker tag quiqcorp/aws-ecs-net-host:1 amazon/amazon-ecs-agent

start ecs

Creating New ECS Instances

When creating the instance, update “User data” shell script, placing the patched Docker on the machine, or create your own AMI. You need to rename (tag) the updated Docker image to amazon/amazon-ecs-agent and restart the ECS Agent to pick up the new image.

Testing

To verify the change is working, run docker inspect on the image and verify that you see “NetworkMode”: “host”. If you see “NetworkMode”: “default” then it is not working.

Conclusion & Disclaimer

We have tested this fix and are using it successfully in our environments. We have experienced dramatic performance increases. Our performance test times dropped from ~10 seconds to ~1 second. That said, this is not supported/endorsed by Amazon nor my company. Use at your own risk. This change makes all ECS Docker containers use net=host. This has not been a problem for us, but it does introduce potential problems. Please understand those potential problems in your environment before using this solution. Once Docker fixes their networking issue, this will no longer be needed.

--

--