Troubleshooting ECS: “Task failed to start”

Vincent Van Gestel
VRT Digital Products
3 min readSep 12, 2023
Task failed to start error message

While many different ECS Task failure scenarios are well documented by AWS in posts like this, there is one nondescript error message that can easily throw you for a loop: “Task failed to start”. That’s all the information you’re going to get from ECS. Adding to the confusion, this error can easily occur in situations where the application containers are starting up cleanly and healthily.

So, what was the cause? When working with container dependencies (multiple containers defined within a single Task and linked with a “ContainerDependency”) you can specify the following parameter: “startTimeout”. This timeout can be fairly confusing if you’re not paying attention. When a container (containerA in the documentation) depends on another container (containerB) to be healthy before starting, then this timeout value can be used to short circuit the normal start sequence and exit early (probably suspecting failure).

Put differently, once the timeout specified by startTimeout (defined on containerB) has passed and containerB is not healthy, containerA will give up and transition to a stopped state, killing the Task. In retrospect, the corresponding reason message of “Task failed to start” matches the experienced failure of containerA. The lack of a mention of any timeouts however unnecessarily obfuscates this reason for the user, especially because your first response will likely be to check exit codes. This is where you can easily dive into a rabbit hole.

Timeline of ECS Container going from healthy to exit code 137 with reason “Task failed to start”

When the task was stopped due to the timeout, a sigterm signal is sent out to the other containers present in the task. The containers then have some time to react to this termination signal before another signal, the sigkill, goes out to forcefully exit the containers. This timeout is configured with the “stopTimeout” and is 30 seconds by default. Once this timeout has passed, the exit code of the container will be 137. Which matches the exit code when a container is killed due to excessive memory usage (OOM). As that wasn’t bad enough, it is possible for the container to become healthy during the stopTimeout, completely hiding the original cause of termination.

As specified in the documentation of startTimeout, once containerA has given up, the Task will end and all subsequent container state transitions will no longer matter. If containerA waits for 30 seconds and containerB becomes healthy after 31 seconds, containerB will still be forcefully killed at 60 seconds, without any obvious apparent reason. Given that the exit code of containerB (137) matches that of an out-of-memory error, you can easily be spending quite some time going through memory usage dashboards, fruitlessly trying to pin down the moment it went over its limits.

If you encounter the poorly labeled ECS task failure reason “task failed to start”, be sure to check if you have container dependencies defined and maybe configured a too tight startTimeout option.

--

--