Orchestrating long running ECS tasks
It’s well documented that there are many different ways to run containers on AWS, each with various pros and cons. However, ECS on EC2 or Fargate, and Lambda tend to be the most favoured for their flexibility, control and ease of use. When you want to run a long running task, ECS Fargate may be the initial go to options because of Lambda’s 15 minute limit. But what if you want to set a ‘max’ time, or a timeout on your task? Tasks on ECS don’t have the ability to self terminate after a period of time. In this post I’ll look at how we can use Step Functions to orchestrate these long running tasks.
But first, why would you want to do this? Most processes have the ability to set a timeout internally, but unfortunately, when dealing with 3rd party binaries or 3rd party services we might not always have that option. Because everything fails all the time, these binaries may fail in unexpected or unknown ways, occasionally getting stuck in their own loop or not gracefully exiting. In these scenarios it can be helful to have something managing the running of these container tasks.
This is where Step Functions can help. One of the many features of Step Functions is the ability to stop and wait for a command to tell them to continue. This is done with the use of a call back implemented with a Task Token. We can use Step Functions to start an ECS task and then wait for that task to finish. When the ECS task has completed, it can call back to Step Functions with the task token and the Step Function will resume. This allows you to go on to implement a ‘success’ action, which may be something like sending an SNS event or write to a DynamoDB Table. So how can we use this to implement max runtime for an ECS task?
Step Functions, that are waiting for a Task Token, support receiving the Task Token as either part of a SendTaskSuccess
, SendTaskFailure
or SendTaskHeartbeat
API call. That step of the Step Function’s workflow can wait any given amount of time to receive any of these commands. This time window is configurable and can be used to allow us to force kill an ECS task after a period of time. It requires a small amount of configuration inside the ECS task and some error handling in the Step Function, which is something they’re great at. When we start our ECS task we can override the container configuration with an environment variable of the Task Token. Our task can then use this to emit a heartbeat at a time interval or send the success or failure commands. Every time the Step Function receives the heartbeat it will know it’s safe to continue waiting for the task to complete and reset that step timeout. If the Step Function doesn’t receive this heartbeat it will result in a Step Function error that can be caught and we can write our own error handling off the back of it. In our case, the lack of heartbeat might indicate that our binary has failed, or the process has gotten into a state it cannot recover from. We can then kill the ECS Task acheiving the timeout that we desire.
When Step Functions integrates with other AWS services, the step that is invoking the AWS API is usually able to access the http response from that AWS API call. In an ECS example, the RunTask API would usually return the ARN of the ECS Task that has started. However, when we’re using the asyncronous waitForTaskToken
approach, rather than the sync
approach, we’re not able to get the ARN of that ECS Task in the Step Functions task output. Because of this we’re not able to get the ARN of the ECS Task when the Step Function calls RunTask
. In the event of a failure or timeout we need to query the Step Function and the History Events of each step. These include the TaskStateEntered
, TaskScheduled
, TaskSubmitted
, TaskStarted
and TaskSucceeded
events. Each of these events contain unique bits of detail.
In order to get the ECS Task ARN we must query the Step Functions API using the GetExecutionHistory
API action. This returns all the events that have occured within the current execution. By parsing the response from this API we’re able to get access to that TaskSubmitted
event and retrieve the ECS Task ARN out from it. Once we’ve got the ARN we are then able to call the ECS StopTask
API. Unfortunately, we need to do this in a Lambda Function. Step Function’s JSON Path querying can only filter on a single selector. In the response of the GetExecutionHistory
API we need to filter on the type == TaskSubmitted
and the TaskSubmittedEventDetails.ResourceType == ecs
.
It is this extra bit of error handling in the Step Function that allows us to set timeouts on any ECS Task we want to spawn, despite the ECS service itself not supporting timeouts on ECS Tasks. By leveraging the power of Step Functions we can increase the control we have over our ECS Tasks and use the native error handling within Step Function to kill any ECS Task which have gone on for too long. AWS Lambda has a strict 15 minute timeout on any function that can run, and here we’ve looked at how we can add a similar limit to ECS Tasks, managed by Step Functions. This helps us keep control over how many ECS Tasks we have running and ensures we are not paying for any compute time we don’t need.