We recently ran up against an issue wherein ECS was not running new tasks on EC2 instances that have always existed in the ECS service. In this post, I’ll go over how we discovered and subsequently resolved the issue.
This post is broken down into the following sections:
- Prior knowledge assumed
- How we discovered there was an issue
- Commands we used to narrow down the issue
- The steps we took to resolve it
Prior knowledge assumed
This post assumes some basic knowledge of:
- AWS ECS (Elastic Container Service)
- AWS ECS terminologies such as Services, Clusters and Tasks
- The fact that AWS ECS can be deployed in 2 “modes” — either as EC2 or as Fargate
- AWS IAM
- AWS EC2
- The AWS Console — you’ll see me mention “Terraform”, but for the purposes of this post, if you’re not familiar with Terraform, think of everything we do via Terraform as performing actions on the AWS console UI
How we discovered there was an issue
“Oh look, a thread I can pull on”
Our team at QuickFrame was knee deep in the process of migrating one of our backend API infrastructures entirely onto Terraform. As part of that migration, several ECS Tasks and related IAM Roles, Security Groups, etc. were being updated after being brought under management via Terraform.
We implemented new Security Groups, related IAM roles and permissions, as well as ECS service and task definitions. After a successful
terraform apply (basically after deploying these updates on AWS), we were trying to delete an old, now invalid Security Group.
When we tried to delete the Security Group via the console, the console said we weren’t allowed to delete it, as it was still attached to an ENI (Elastic Network Interface). That seemed weird; since we had updated all our infrastructure, there ought not to be an ENI that contained older Security Groups. Ideally, all ENIs ought to have newer Security Groups.
“I’m in the middle of the thread I am pulling”
So we decided to pull on that thread further. We decided to see where that ENI came from. For the purposes of this article, think of an ENI as a virtual ethernet cable that someone decided to plug into our EC2 box. The description field for the ENI (on the AWS console) showed us that it was connected by ECS. Pulling on that (virtual) cable, we realized that our APIs ECS task was actively using that ENI attachment (the other end of our virtual ethernet cable).
Fascinating… since we had updated the ECS Service to use the new security group! Why then would it use an ENI that uses the old security group? That is when the lead on this project noticed that the ECS Task definition revisions weren’t lined up. The ECS Service was supposed to be on one version greater than what was actually running. Furthermore, there were supposed to be 2 tasks total running on 2 EC2 instances total (one on each). Instead, we only had 1 task running, which was one version behind, on 1 EC2 instance. The other instance showed as “Active”, but had 0 tasks running.
“I need to disentangle this knot of threads!!”
The question we were all asking, is what is going on? We did a massive
terraform apply on production infrastructure, and everything “seems to be working” from an end-user’s perspective. And yet, we are running on just 1 instance, which was on an older revision. I’m glad we had backwards compatible code and also had not done any major deployments!
At this point, our knowledge of ECSs helped. ECS, like any sensible orchestration layer, when asked to deploy a new Task (or node for the purposes of this article) does the following:
ECS ensures that it is a gradual update. Only after one instance is in the stable state, does it move to updating the other instance.
Since one of our instances was not running a task, and the other instance was running an older task, we realized something like this might have happened:
So now we had it: The reason we were unable to delete the security group was because an ENI that was using it was still attached. The ENI was attached because it was an ENI related to an older ECS Task definition revision, potentially because ECS wasn’t able to launch the new Task definition.
So why was ECS unable to launch the new Task definition on EC2 instances that have always been running as part of this ECS Service?
Commands we used to narrow down the issue
Perfect — time for some detective work. ECS communicates with EC2 instances via an ECS Agent. An ECS Agent is a piece of software that runs on EC2 instances, and relays system information to ECS, and executes ECS commands on the system. First, we decided to see whether the ECS agent was working as expected.
SSH-ing into EC2 instances
To do that, we SSH-ed into the system. If you are setup for Systems Manager, then you can SSH via Systems Manager. In the event you are still using a jump host to tunnel into your EC2 instances in the private VPC, then use the following commands.
Sidebar: IN THE RARE EVENT that your EC2 instances are sitting in the public VPC, and you aren’t sure why that is bad and you did NOT put it in the public VPC for a very well formulated and specific business reason, please stop reading this article and do what it takes to move your EC2s to a private VPC and read up on public v/s private VPCs, security best practices and Systems Manager.
To add keys to your local ssh agent:
ssh-add -K path/to/your/key
To ssh into the jump host with the intent of further ssh-ing into the EC2 instance, use:
ssh -A firstname.lastname@example.org
-A option is for forwarding connections with an authentication agent. It means it forwards authentication information stored by
ssh-agent. You can find more information about that on SSH man pages or other articles.
Next, to further tunnel into the actual EC2 instance:
You can get the private IP of the instance from the AWS console.
Now you’re in the instance!
So, is the ECS agent working?
To check on the ECS agent for Linux ECS optimized AMIs, you can use
sudo systemctl status ecs .
When we ran the command, we saw something like this:
Humm, so the little
active (running) message in “happy green” tells me that everything is fine and dandy. However, we are still facing an issue. The detailed log output was truncated, as you can see by the ellipses. Owing to that, we decided to use
-l to see a more verbose output, i.e.
sudo systemctl status ecs -l. Once we used
-l , this is what we saw:
Ah-ha — so it is not “all that happy green” after all! I ran the command again after a few seconds; it seems like the ECS Agent does an incremental backoff:
Great! So something is up with the ECS agent. This was good, since it lined up with the “symptoms” we were experiencing on the ECS console (an older Task revision being present and fewer Tasks running).
Alright, what permissions does the EC2 instance have?
Next, we decided to look at the permissions for the EC2 instance — was it even allowed to talk to ECS?
To check that, we ran
Sidebar: Information about this magical
169.254.169.254 is hidden deep in the EC2 documentation. And if you’re like me (startup, many hats, no dedicated devops, etc.) you probably won’t come across it until you had to come across it because something blew up and you stumbled onto this information when production was on fire and you were deep in documentation hell from all the places.
Anyway, the result of the
In that screenshot, the
InstanceProfileArn did not match the IAM role we would have expected to see there. This IAM role had been updated by our terraform deployment. In the past, I’ve also had this
curl command return nothing (or maybe not found if memory serves me correct — but it was about a year ago).
In any case, that discrepancy made us realize that the EC2 instance was working using an IAM role that was not valid.
The steps we took to resolve it
Nice. So next, we decided to go to the EC2 console → Select the instance (do not click into the instance row) → Actions → Security → Modify IAM Role:
FYI, you can also do all of these via CLI, Terraform etc.
From the Modify IAM Role screen, I updated the instance to the new IAM role.
We went back to our SSH terminal (into the instance) and tried
curl http://169.254.169.254/latest/meta-data/iam/info again. And voila, the
InstanceProfileArn value was exactly what we wanted it to be.
Next, just to be “extra safe”, I stopped the ECS agent using
sudo systemctl stop ecs and then restarted it using
sudo systemctl start ecs .
Now, when we ran
sudo systemctl status ecs -l , here is what we saw:
As you can see, there is no more “ECS Agent failed to start, retrying in xyz seconds” message! This is the exact same message that displayed when I ran
sudo systemctl status ecs -l again after a few seconds.
Theoretically, it looked like ECS ought to be all set. No surprises there (whew!); by the time we had switched back to the ECS window, the new (and correct) Task definition was already being “Provisioned”.
We repeated the same process in the other EC2 instance as well, and the new instance was also on the new Task definition within a minute or so. The ENIs showed the correct security groups, and we were able to delete our old, invalid security groups!
Unless you can write down a really really good list of reasons as to “why I need an EC2 deployment type”, that does NOT include justifications such as, “at some point in the future I will need to have more control/be better for migrations/be vendor agnostic” etc., others on your team also have a strong preference for EC2 deployment, and you actually see solid business value and reason to do an EC2 deployment, just use Fargate.
In a nutshell, Fargate lets you deploy your tasks, without needing to worry about the underlying EC2 or compute layer. This entire blog post, as well as the pain behind it could have been avoided had we been using Fargate.