linkerd: A service mesh for AWS ECS

linkerd has great support for kubernetes (k8s). There are plenty of articles online explaining how to configure linkerd for different deployments, including the excellent A Service Mesh for Kubernetes written by Buoyant itself (the linkerd creator).

But what about AWS EC2 Container Service (ECS)? There is very little content online that explains how to set it up. This article aims to solve that problem, without diving into the details of how linkerd itself works as there’s plenty of that online already.

As opposed to Kubernetes, AWS ECS is simply a container scheduler and very little more. It doesn’t provide features such as service discovery and application configuration. For those, you’ll need an external tool or craft something yourself. Consul is a great tool for both service discovery and configuration and linkerd provides good support for it. This article won’t cover the details of setting up a Consul cluster, but this AWS article should help.


Basic configuration

linkerd can be deployed in several configurations, each one with its advantages and disadvantages. Deploying linkerd in a sidecar configuration is the easiest configuration to setup in AWS ECS. In this configuration, each one of your ECS tasks will be composed of the web service container and the linkerd container, like in the following diagram:

sidecar configuration

ECS task definition

Here is an example ECS task definition:

This example uses Buoyant’s helloworld app as the main web service container, which is great for getting started and debugging the linkerd configuration. It can also simulate slow and unstable services. It is then linked to the linkerd container via Docker links. This means that the hello-web-service can talk to linkerd simply by sending requests to http://linkerd:4140 and Docker will take care of routing the request to the sidecar.

linkerd configuration

Here’s the minimal linkerd configuration:

In this configuration, when the hello-web-app service above hits http://linkerd:4140/world (step 1 in the diagram), the service with name world will be looked up in Consul (if not already cached) and the one of the corresponding registered services will be hit (step 2). It’s obviously possible to setup load balancing, circuit breaking, retries, etc. as part of this configuration, just refer to the main documentation for details. A different setup would be to configure linkerd as an HTTP proxy, so that all outgoing requests are routed through it. In that case to hit the world service the URL simply becomes http://world. The linkerd configuration is slightly different in that case:

To feed the configuration to linkerd I came up with two possible solutions, both involving building your custom linkerd Docker image and pushing it to AWS ECR.

Bake your configuration into your custom linkerd Docker image:

Or download the config from S3 when the linkerd Docker container starts up:

You will also need to change your ECS task definition to point to your custom Docker image instead of Buoyant’s one. The first solution is slightly easier, but with the second one you won’t need to create a new Docker image just to change config.

Disadvantages

There are three main disadvantages to the sidecar configuration:

  1. High memory usage: Each linkerd container needs at least 115 MB according to this article, although I haven’t managed to make it run without crashing under heavy load with less than 230 MB reserved memory space. If each of your ECS tasks uses an additional 115 MB of memory the total memory usage across the cluster for linkerd will have a fairly big impact on costs.
  2. No encryption: In order to transparently encrypt communications between ECS tasks, linkerd needs to be deployed in a linker-to-linker configuration. Having no encryption might be acceptable for internal traffic in a private VPC, but it is still certainly not recommended.
  3. Partial distributed tracing: It is very useful to trace requests across different services in order for performance issues to be quickly identified. When your microservice architecture spans plenty of different services, distributed tracing is a must have. In a sidecar configuration there is no incoming linkerd router, so there is no information about when any of the requests has reached the recipients, but only information about when they were sent.

If you can live with these three things, I would recommend going for this solution, as it’s at least an order of magnitude easier than the more advanced one!


Advanced configuration

To avoid the shortcomings of the basic configuration, we want to deploy linkerd with only one linkerd container running in each ECS instance (i.e. EC2 instance running in the ECS cluster) and in a linker-to-linker configuration, like in the following diagram:

linker-to-linker configuration

First problem: Schedule an ECS task exactly once per host

ECS doesn’t offer any support for exactly once scheduling (see issue here). The closest thing to it is One Task Per Host, which allows an ECS task to run at most once per ECS instance without any guarantees. The linkerd container needs to be running at all times inside the ECS instance, or none of the web services in the same instance will be reachable.

One solution is to run it outside of ECS, on the EC2 instance itself. Given you have set up an ECS cluster already, you probably already have a Launch Configuration set up. Simply start the linkerd container as part of the Launch Configuration User Data. Example script:

Note that the “Memory available” and “CPU available” values in your ECS instances will not account for the memory and CPU used by containers that are running outside of ECS. This also means that the ECS CloudWatch Cluster Reservation and Cluster Utilization Metrics will be slightly skewed.

Second problem: Bi-directional communication from and to linkerd

For linkerd to work in a linker-to-linker configuration, the ECS tasks need to be able to communicate with the outgoing linkerd router when requests are sent, and the incoming router needs to be able to talk to the ECS tasks in the same ECS instance when requests are received. Both outgoing and incoming linkerd routers run inside the same linkerd container.

Docker makes it very easy to have bi-directional communication between containers by using User-defined Networks. Unfortunately, ECS still does not support them (see issue here). It only supports Docker Links which have been deprecated in favour of User-defined Networks around October 2015. Shame!

One (ugly) way around this issue is to make use of the ECS instance IP address. Looking at the diagram above, hello should be configured to talk to world via http://10.100.100.12:4141/world, where 4141 is the port the outgoing linkerd router is listening on. The ECS instance IP address can be found by running curl -s 169.254.169.254/latest/meta-data/local-ipv4 (see Retrieving Instance Metadata). You need to have a way to inject values into your app configuration at runtime (when the container starts up), as only then you’ll know the IP address of the ECS instance. This can be pretty inconvenient, depending on your setup.

Similarly, the incoming linkerd router in the recipient ECS instance needs to be configured to talk to the web services on the same ECS instance. This is easily done by injecting the IP address in the linkerd config when it starts up:

linkerd configuration

As mentioned before, two different routers need to be set up as part of the same linkerd configuration, one for outgoing requests and one for incoming.

The minimal outgoing router configuration looks like this:

Here we configure a router listening on port 4141 that uses a io.l5d.port transformer. What this transformer does is transform whatever physical endpoint has been resolved by the namer (i.e. service discovery system) and change its port to 4041.

So when hello makes a request to http://10.100.100.12:4141/world to hit the world service (step 1 in the diagram above), the service name will be bound to client name /#/io.l5d.consul/dc1/world and then resolved to physical endpoint 10.100.200.98:39876 by the namer (Consul in this case). The io.l5d.port transformer will transform this to be 10.100.200.98:4041, in order to hit the incoming linkerd router on the destination ECS instance (step 2).

Now the incoming linkerd router on the destination ECS instance receives the request and needs to know where to route it (step 3). Let’s have a look at its configuration:

When the request is received, it has a HTTP header attached to it: l5d-dst-service: /out/world. This is how the incoming router knows where to route the request. Thanks to the io.l5d.header identifier, the incoming request is identified as /in/out/world, where the /in prefix comes from the dstPrefix configuration. This service name will be bound to client name /#/io.l5d.consul/dc1/world given the dtab configuration. The namer will resolve it to physical endpoint 10.100.200.98:39876 given the example in the diagram.

There could (and should) be more than one physical address associated to world, but we want to filter the list to hit the service that is running in the same ECS instance. The io.l5d.specificHost transformer does exactly that. Note that the host value will actually be set to 10.100.200.98 once the ${ECS_INSTANCE_IP_ADDRESS} value is injected at startup, as explained above.


A word of warning

When you deploy a new version of your ECS services, ECS will do a rolling deployment, ensuring no downtime. If you have a load balancer attached to the service, it will use its health check to verify that the new containers are healthy before killing off the old ones. If an ECS service doesn’t have a load balancer attached to it (likely in our setup), the deployment considers the new version healthy if all the essential containers in the new ECS tasks do not exit. The fact that the new containers do not exit does not necessarily mean that they are healthy and ready to accept requests. You have two choices to solve this issue:

  1. Craft something yourself that kills the Docker container when it’s not responding to a health check endpoint running inside your web service
  2. Attach a load balancer to it even if not used, just to ensure no downtime deployments. Note that there is an associated cost to doing this.

In the meantime, AWS is working on a solution for this problem (see issue here).


Conclusion

AWS ECS lacks features that make the advanced linkerd configuration quite difficult. It also lags behind some substantial Docker features. This is all quite frustrating, especially given it is not open source so all we can do is wait for them to build the features needed. The good news is that linkerd are currently having long conversations with AWS to see what can be done to make the integration easier, and this blog article will hopefully feed the discussion.

Other than that, I’m really pleased with linkerd and especially with the linkerd community. The Slack channel is very active and questions get answered quickly. It is an open source tool and pull requests are reviewed, merged and released just as quickly. For example, I made a pull request to add the io.l5d.specificHost transformer and it got merged in less than 5 days, and released in 11 days.

I hope you found this article helpful and if you have any questions you will find me in the linkerd Slack channel. Happy coding!