Monitoring the Health of ECS Service Deployments

Aaron Kaz Kaczmarek
9 min readMay 1, 2019

The prominent feature of any continuous DevOps tool-chain, of course, is automation. Between Amazon Web Services’ ECS offering and their suite of CI/CD products (such as CodePipeline and Codebuild) you can setup an automated, containerized application deployment pipeline from commit to deploy virtually out-of-the-box with minimal configuration. However, while ECS does a great job on its own of managing the state of container tasks, calling health checks, registering and de-registering from load-balancers, and so on, there is a weak link in the chain that will inevitably require special attention — deployment health monitoring. Here I will discuss strategies for monitoring ECS deployments and hooking in behaviors such as automated rollbacks in the case of failures.

A Brief Overview of ECS and Service Deployments

In ECS, we have three main components: clusters, services, and tasks. For simplicity, we can think of clusters as the physical server instances where containers will be hosted, services as the applications we want to host, and tasks as the instructions for starting containers and networking them. Just as container builds are, ideally, version tagged, task definitions are versioned in ECS so that snapshots of the application via container builds may correspond to specific task definition versions. An ECS service then is associated with exactly one version of a task definition where a deployment is created by updating the service with a new task definition. In a typical blue/green deployment, ECS will first attempt to start the desired tasks for a new deployment on the associated cluster. Once fulfilled, it will then deregister the tasks from the service-attached load balancer (if applicable) and stop all of the tasks started by the previous deployment (previous task version). Here’s an example flow summary starting from a new task and service.

  1. Create cluster my-cluster
  2. Create Docker image tagged my-app:v1
  3. Create Task Definition my-app-task:1 pointing to image my-app:v1
  4. Create Service my-app-service with task definition my-app-task:1 and a desired count of 1 task
  5. [ECS downloads and starts my-app:v1 container on the cluster]
  6. Create Docker image tagged my-app:v2
  7. Revise Task Definition to point to image my-app:v2, creates new task revision my-app-task:2
  8. Update service my-app-service with task definition my-app-task:2
  9. [ECS downloads and starts my-app:v2 container on the cluster]
  10. [ECS stops my-app:v1 container on the cluster]

Why We Need Deployment Monitoring

The big problem, or gotcha, with ECS is that it assumes the best conditions. Without human intervention, a bad deployment may attempt to place tasks (start containers on the cluster) indefinitely without ever succeeding. Under certain circumstances a failing deployment can take down an entire cluster instance which might be disastrous for a production cluster. But on the other hand, we may also have business-critical functions that need to be called once a deployment has succeeded. In any case, ECS doesn’t provide an explicit indicator or callback action by default, though it does give us enough information to infer the state of a deployment. Thus we must take extra steps ourselves, using the tools AWS gives us, to positively assert a deployment success or failure and execute the appropriate followup actions.

There is No Way but to Poll

Once we issue a deployment by way of updating a service, there is no way to determine its status other than to continually poll for updates until we have enough information to make a positive assertion. A pure shell script solution, ecs-deploy, can be a good place to start to understand this full strategy. For our own solution, we‘ll start by retrieving current information about our service from the AWS ECS DescribeServices API. If we query via the cli right after issuing a new deployment, we’ll see output like this:

$ aws ecs describe-services --cluster my-cluster --services my-app-service
{
"services": [
{
"serviceArn": "arn:aws:ecs:us-east-1:<aws-id>:service/my-app-service",
"serviceName": "my-app-service",
"clusterArn": "arn:aws:ecs:us-east-1:<aws-id>:cluster/my-cluster",
"loadBalancers": [],
"status": "ACTIVE",
"desiredCount": 1,
"runningCount": 1,
"pendingCount": 0,
"launchType": "EC2",
"taskDefinition": "arn:aws:ecs:us-east-1:<aws-id>:task-definition/my-app-service:2",
"deploymentConfiguration": {
"maximumPercent": 200,
"minimumHealthyPercent": 100
},
"deployments": [
{
"id": "ecs-svc/<deployment-id>",
"status": "PRIMARY",
"taskDefinition": "arn:aws:ecs:us-east-1:<aws-id>:task-definition/my-app-service:2",
"desiredCount": 0,
"pendingCount": 0,
"runningCount": 1,
"createdAt": <timestamp>,
"updatedAt": <timestamp>,
"launchType": "EC2"
},
{
"id": "ecs-svc/<deployment-id>",
"status": "ACTIVE",
"taskDefinition": "arn:aws:ecs:us-east-1:<aws-id>:task-definition/my-app-service:1",
"desiredCount": 1,
"pendingCount": 0,
"runningCount": 1,
"createdAt": <timestamp>,
"updatedAt": <timestamp>,
"launchType": "EC2"
}
],
"events": [
{
"id": "<UUID>",
"createdAt": <timestamp>,
"message": "(service my-app-service) has reached a steady state."
},
{
"id": "<UUID>",
"createdAt": <timestamp>,
"message": "(service my-app-service) has started 1 tasks: task <UUID>."
}
],
"createdAt": <timestamp>,
"placementConstraints": [],
"placementStrategy": []
}
],
"failures": []
}

Of particular importance to us is the $task.deployments property. This array is always ordered descending by a createdAt timestamp so we can safely assume that the first object is the most recent deployment and the last object is the last stable deployment. With this information, we can start the process of resolving the state of our latest deployment. Before we can assert a success though, we must first rule out a bad deployment.

Identifying a Failing Deployment

There are a small handful of causes for deployment failure:

  1. The container entrypoint/command process exits suddenly and unexpectedly — we could call this a build bug
  2. A container health check (configured in the task definition) fails after startup.
  3. If attached, the container fails the configured target group health check for its respective Load Balancer.
  4. There are not enough resources available (CPU and/or memory) on any of the cluster instances to place the desired task.
  5. Lastly, internal AWS errors or outages can occur, although quite infrequently

The most obvious place to look for failures is in the service description at the $service.events key. Sorted by most recent, ECS logs all task activity as event messages which are wonderfully clear and unambiguous.

{
"id": "<event-UUID>",
"createdAt": <timestamp>,
"message": "(service <service-name>) (task <task-UUID>) failed container health checks."
},
{
"id": "<event-UUID>",
"createdAt": <timestamp>,
"message": "(service <service-name>) has started 1 tasks: task <task-UUID>."
}

Taking the PRIMARY deployment createdAt timestamp as the starting point, we can proceed to filter all event messages that were logged afterward, and using regular expressions, come up with stats about our deployment progress. There is a full solution using this strategy outlined on the AWS Blog. Tempting as it is to stop here, there is a big gotcha that makes this strategy not ideal for us — unexpected container exits are not logged in the events array. If we had a bug in our build that prevented a successful start, we would see output like this:

{
"id": "<event-UUID>",
"createdAt": <timestamp>,
"message": "(service <service-name>) has started 1 tasks: task <task-UUID>."
},
{
"id": "<event-UUID>",
"createdAt": <timestamp>,
"message": "(service <service-name>) has started 1 tasks: task <task-UUID>."
},
{
"id": "<event-UUID>",
"createdAt": <timestamp>,
"message": "(service <service-name>) has started 1 tasks: task <task-UUID>."
},
{
"id": "<event-UUID>",
"createdAt": <timestamp>,
"message": "(service <service-name>) has started 1 tasks: task <task-UUID>."
},
{
"id": "<event-UUID>",
"createdAt": <timestamp>,
"message": "(service <service-name>) has started 1 tasks: task <task-UUID>."
}

And on and on…

With no other instruction or intervention, ECS will continually retry placing and starting the buggy container within seconds after a failure, blowing up the events log with has started 1 tasks messages. Given a low desiredCount, we might confidently assume deployment failure from this information. However, there is still a possibility of intermittent internal errors that can quickly clear up on their own within the span of a deployment (I’ve seen it happen). Scaled to a higher task count, I think that drawing conclusions from event data alone is fuzzy and requires more statistical analysis than I want to invest. Rather than make assumptions though, we can go straight to the tasks to find out exactly what is going on.

Taking the $deployment.id value, we query the ListTasks API for all STOPPED tasks associated with the deployment.

$ aws ecs list-tasks --cluster <cluster-name> \
--started-by <deployment-id> --desired-status STOPPED

If tasks are failing for any reason, we’ll get output like this:

{
"taskArns": [
"arn:aws:ecs:us-east-1:<aws-id>:task/<task-uuid>",
"arn:aws:ecs:us-east-1:<aws-id>:task/<task-uuid>",
"arn:aws:ecs:us-east-1:<aws-id>:task/<task-uuid>",
"arn:aws:ecs:us-east-1:<aws-id>:task/<task-uuid>"
]
}

Finally, we call DescribeTasks to get the full output and reasons for the stopped tasks:

$ aws ecs describe-tasks --cluster <cluster-name> \
--tasks <task-arn>,<task-arn>,<task-arn>,<task-arn>

{
"tasks": [
{
"taskArn": "arn:aws:ecs:us-east-1:<aws-id>:task/<task-uuid>",
"clusterArn": "arn:aws:ecs:us-east-1:<aws-id>:cluster/my-cluster",
"taskDefinitionArn": "arn:aws:ecs:us-east-1:<aws-id>:task-definition/my-app:2",
"containerInstanceArn": "arn:aws:ecs:us-east-1:<aws-id>:container-instance/<uuid>",
"overrides": {
"containerOverrides": [
{
"name": "app"
}
]
},
"lastStatus": "STOPPED",
"desiredStatus": "STOPPED",
"cpu": "128",
"memory": "244",
"containers": [
{
"containerArn": "arn:aws:ecs:us-east-1:<aws-id>:container/<container-uuid>",
"taskArn": "arn:aws:ecs:us-east-1:<aws-id>:task/<task-uuid>",
"name": "app",
"lastStatus": "STOPPED",
"exitCode": 1,
"networkBindings": [
{
"bindIP": "0.0.0.0",
"containerPort": 80,
"hostPort": 32825,
"protocol": "tcp"
}
],
"networkInterfaces": []
}
],
"startedBy": "ecs-svc/<deployment-id>",
"version": 3,
"stoppedReason": "Essential container in task exited",
"connectivity": "CONNECTED",
"connectivityAt": 1555612613.103,
"pullStartedAt": 1555612614.208,
"pullStoppedAt": 1555612614.208,
"executionStoppedAt": 1555612618.0,
"createdAt": 1555612613.103,
"startedAt": 1555612615.208,
"stoppingAt": 1555612618.879,
"stoppedAt": 1555612618.879,
"group": "service:my-app-service",
"launchType": "EC2",
"attachments": []
}
],
"failures": []
}

Examining the stoppedReason property, we get a concrete answer to our inquiry ( "Essential container in task exited” ) and can safely stop here. Now we can easily determine what kind of deployment failure we have and act according.

Identifying Successful Deployments

Once we rule out the possibility of failure, it is really quite easy to determine a deployment success. We simply poll our service and wait for the deployment runningCount to equal its respective desiredCount. The one gotcha is that a task can very briefly stat as RUNNING before an exit signal or health check stops it. If we too eagerly look to the count numbers to show success, we may get a false positive. Fortunately, if tasks are going to fail they will do it quickly. So, as long as we leave a time buffer between updating the service and checking its status, being sure to rule out failure first, we should be safe with our success assertion.

Worst-case Scenario

Unfortunately, there is no way for us to cancel a deployment once it is issued by ECS. This means we’re compelled to take some action to correct a potential deployment problem once we’re aware it exists (implying we should send notifications at minimum). Typically, ECS will retry indefinitely to fulfill a deployment until a new order is given by way of updating the service with a stable task definition — in which case it will cancel the previous unfulfilled deployment. In the case of not having enough resources, the simple remedy is to add instances to the cluster. As long as we’ve configured our service deployment configuration to maintain a healthy percentage, there is usually enough tolerance here for us to find a solution while having our application remain live and stable. There is one set of circumstances, however, where there is no such grace.

[Read this portion of the AWS Docs for more details on the ECS service scheduler]

The worst-case scenario for a deployment failure is the host instance running out of disk space. Once this happens, it is often the case that the entire instance and all of its running containers will become unhealthy and lead to service outages. If the remaining resources in the cluster are not sufficient enough to carry the extra burden created by those outages, the situation can go from bad to worse in a “domino-effect” without immediate intervention. Of course this is the primary reason we should want automated deployment rollbacks in the first place, but it also means we should consider automated strategies that prevent such conditions from ever taking place and correcting them when they do.

Deployment Monitoring with AWS State Machine

Conclusion

Deployment monitoring has implications beyond handling errors exclusively. As we utilize ECS for deploying microservices, creating service meshes, and managing internal load balancers, we need ever more fine-grained control and transparency between ECS task and deployment events and our own service discovery solutions. Though many turn-key, drop-in tools are available (including Amazon’s own App Mesh service, now ready for production use), the benefits of configurable black-boxed solutions may diminish in favor of direct control via the AWS API. In particular, the External Controller deployment method may provide a far better experience for full deployment control and monitoring. In any case, the key takeaway is this: we cannot rely on ECS alone to take care of all our needs in a real-life production environment. Using all the tools available to us in AWS services and API SDKs, we must creatively engineer our own solutions to maintain the health and proper functioning of our container infrastructure.

--

--