EXPEDIA GROUP TECHNOLOGY — SOFTWARE

Troubleshooting 502 errors: ECS Checklist

A guide on what to look at if your ECS app is experiencing 502 errors

Drew Todd
Expedia Group Technology

--

Photo by You X Ventures on Unsplash

After the start of COVID-19, Expedia’s™️ Conversation Platform experienced higher than normal traffic from customers trying to change or cancel their orders. This increased traffic resulted in a number of issues, including increased occurrences of 502 errors. This post is a guide for troubleshooting and remediating the most common causes of 502s, along with troubleshooting suggestions for more in-depth investigations.

Note: The focus of this document is on infrastructure-related causes, specifically for Docker-based ECS apps. It does contain some suggestions for other app types, but the main goal is to eliminate any scaling, sizing, container, or ELB related issues before digging into the code.

What is a 502 error?

From Mozilla:

“The HyperText Transfer Protocol (HTTP) 502 Bad Gateway server error response code indicates that the server, while acting as a gateway or proxy, received an invalid response from the upstream server."

This means that a connection was made to a server, and that server responded with unexpected or unparseable data. The cause of such an error may be challenging to track down since it can be infrastructure related, code related, or simply an issue with a web browser. Luckily, there are common culprits that can be checked fairly easily. These are listed in this guide below.

ECS Configuration Checklist

Before digging into troubleshooting, run through this checklist to see if each configuration item has been considered (the checklist items are covered in more detail in the sections below):

  • Does the application have enough reserved CPU and memory?
  • Does the application have high enough maximum task count?
  • Is the application scaling on appropriate metrics (CPU/memory)?
  • Are health checks optimized to account for application startup time?
  • Does the application have a high enough initial task count?
  • Is gradual traffic cutover enabled during releases by leveraging bleed time options?
  • Are there enough tasks running at the time of release to handle the increased load?
  • For Spring Boot Embedded Tomcat-based applications, is the thread count and min/max heap set properly?
  • Is the application’s keepAlive value higher than the ELB’s timeout value?

When the application is experiencing 502 errors, what steps can be taken to find the cause and remediate the issue?

Go through the following list and check each question to see if the current situation applies.

The format of this section is as follows: A question, a guide on how to answer the question, and a guide on how to implement changes that pertain to that question.

1) Does the application have enough resources to function properly?

How do I answer this question?

  1. Go to the ECS console in AWS
  2. Select the app cluster in which the problematic application is running
  3. Search for the affected service and select it (may require scrolling through the various pages — the AWS UI is poor in this area)
  4. Select the Auto Scaling tab
  5. Compare the “running count” value at the top of the screen to the “Maximum tasks” count under the Autoscaling tab. Are they the same value?
  6. Select the Metrics tab under the ECS service.
  7. Look at the CPU Utilization and Memory Utilization graphs, specifically the “max” and “avg” lines. Are these reaching over 80%?

If the maximum task count is equal to the current running task count AND one of the CPU Utilization or Memory Utilization graphs is very high, it’s likely that the application just needs more capacity.

Possible problem description

The application is experiencing heavy load but isn’t allowed to scale properly as a result of a too-low maximum task count. This results in overloaded tasks and 502 errors.

Possible resolution

  • If the application requires more resources, there are two ways that can be accomplished: either increasing the task count or increasing the reserved CPU/memory.
  • This is a balance and it’s challenging to optimize perfectly.

Choosing whether to increase task count or to increase CPU/memory reservation

The general rule is as follows:

  • If there are many requests and each request takes a small amount of CPU/memory, then increase the task count. The more tasks, the easier the load can be distributed.
  • If there are few requests and each request takes a large amount of CPU/memory, then increase the reserved CPU. The more reserved CPU, the easier it is for an individual task to complete a single request.

Validating increased capacity changes

The most effective way to validate any changes to task count or capacity would be to deploy the application to a perf environment and simulate the same load that the application would see in production.

Questions to ask:

  • Is the average CPU/memory below 80%?
  • Is the max CPU/memory spiking over 100% consistently?

2) Does the application scale on appropriate metrics?

How do I answer this question?

  1. Go to the ECS console in AWS
  2. Select the app cluster in which the problematic application is running
  3. Search for the affected service and select it (may require scrolling through the various pages — the AWS UI is poor in this area)
  4. Select the Auto Scaling tab
  5. View scaling policies and the metrics they scale off of

Are the policies present here sufficient to handle every scaling action? Does it have CPU and memory when it should only have CPU?

Possible problem description

  • The application’s autoscaling policy doesn’t include necessary metrics and therefore the application doesn’t scale to meet demand properly, resulting in overloaded tasks and 502 errors.

Possible resolution

  • Modify the scaling policy such that it scales on the correct metric, either CPU or memory (but NOT both)

Is there any harm in keeping both auto scaling dimensions in the config?

  • Yes, it is potentially harmful as your application behavior is not deterministic. Your application might constantly scale up and down due to both auto scaling dimensions. E.g. your application might scale up due to high memory utilization but scale down immediately due to low CPU utilization.

3) Does the container have enough time to start up before receiving traffic?

How do I answer this question?

  • One method for determining startup time for an application involves running CAdvisor locally to view stats about the application.
  • First, run cadvisor on a local machine by running the following command:
docker run \--volume=/:/rootfs:ro \--volume=/var/run:/var/run:rw \--volume=/sys:/sys:ro \--volume=/var/lib/docker/:/var/lib/docker:ro \--publish=8081:8080 \--detach=true \--name=cadvisor google/cadvisor:latest
  • Next, open a web browser to localhost:8081. There should be stats available on that page to view.
  • Next, run the affected application locally. Below is an example of a command used to run one of our applications locally.
docker run \--name {app name} \--rm \--cpus=0.5 \-e "{environment variable name}={environment variable value}" \-v /Users/USERNAME/:/home/webapp/ \-p 8443:8443 \-p 8080:8080 \123456789.dkr.ecr.us-west-2.amazonaws.com/{app image name}:{version}

Keeping CPU consistent in local test

Make sure to take into account the --cpus parameter. 1 CPU in Docker corresponds to 1024 CPU units in AWS. Check that the application's provisioned CPU will correspond to making the local application run as similar to the AWS deployed run as possible.

  • Once it’s up and running and (hopefully) not throwing errors, open up a web browser to localhost:8081 again
  • Select “Docker Containers”
  • Select the docker container corresponding to the application in question and click on it to drill into it
  • Scroll down the page and find the “Total Usage” graph under the “CPU” section.
  • The amount of time an application takes to start up is reflected as the difference between the beginning of the graph and when the datapoints drop to a consistent level.

Possible problem description

  • The time between when a new task is initially spun up and when it goes into service is too short, and does not allow for the task to spin up in order to handle the incoming traffic. This results in 502 errors.

Possible resolution

  • Allow the application more time on startup before initiating health checks. If health checks begin running immediately, it’s possible that the application won’t be able to respond to them due to the application’s startup process. This will then shut down the instance and start the same process on another new task, potentially leading to a perpetually-unhealthy service.
  • The amount of time an ELB waits before declaring a task as “healthy” and putting it into service is determined by how often it runs the health checks multiplied by how many successful health checks are required before the application can go into service. If the application took 2 minutes to start up and the health check interval is set to 30 seconds with no delay to start health checks, we would need to allow at least 5 health checks to run for this app (4 in order to get though the 2 minute startup process, and 1 more to get an actually healthy request).

Optimize the applications healthchecks by doing the following:

  • minimize interval
  • minimize healthy_threshold
  • maintain the following rule: healthy_threshold * interval > application startup time

4) Was the application deployed recently (within the last 10–20 minutes)?

How do I answer this question?

This one is easy. Check the time since you last deployed.

Possible problem description

The initial capacity of the application was not high enough to meet traffic demands. This means that when the application was deployed, there were too few tasks running initially to handle all of the traffic and it takes time to scale up.

Possible resolution

Short-term:

  • Wait a few minutes for the application to scale. This will not immediately resolve the 502 errors, but the application should eventually scale up to meet load demands if the max task count is high enough.
  • Do not rollback, this will likely result in more 502 errors since the initial capacity of the previous version will likely be the same value as the current version.

Long-term:

  • Do not deploy during peak hours. This way, the application has enough time to scale up for peak traffic hours.
  • When deploying the application, set the “desired task count” equal to your “max task count”. This will result in the application scaling up to its maximum capacity when initially deployed, which should be able to handle the load if the max task count is high enough. The application will eventually scale down to match traffic demands.

5) Was there a significant time gap between deploying and releasing the application?

For any of those who might not be aware of the difference between “deployment” and “release”, a deployment is when the application is pushed to the desired environment and started. A release is when traffic is allowed to hit that application. It is possible to deploy an application and have it running in your production account without actually allowing customers to access it, allowing you to do things like perform tests in the current environment if required.

How do I answer this question?

Check the amount of time between when the application was deployed and when it was released. If it’s greater than roughly 20 minutes (depending on the application’s health check settings), then it could be the cause of the issue.

Possible problem description

Before the release step, it is likely that there is no traffic on the machine other than /isActive health checks. Given enough time, this will cause the number of tasks for that application to drop since they don't need to handle any traffic. When a release occurs, if the release step is not configured to gradually cut over (note that it is not configured to gradually cut over traffic by default) then the new version gets hit with 100% of the traffic at once. Since there are so few tasks running as a result of the previous inactivity, it will take time for tasks to scale and during that time 502 errors may occur.

Possible resolution

Short-term:

If there is an application that has been deployed that needs to get released into production, the number of tasks can be manually increased via the ECS console or AWS CLI prior to releasing the application.

  • Increase the number of tasks to your expected max count
  • Wait for new tasks to spin up
  • Run the release step before tasks spin back down again

Don’t wait too long between the deployment and release steps, if it can be helped.

Long-term:

If the application does require time between deployment and release (eg testing time, other validation, waiting for non-peak hours), there are methods that can be used to gradually shift traffic over, allowing the application to spin up faster to deal with the gradual cutover. In my organization, we have written some automation that modifies the two Route53 entries for the two versions of the app we have running. This automation requires 2 pieces of input: what % of traffic should be cut over each interval, and how long should each interval be. When releasing an app, we use this automation to slowly change the “weight” value for the 2 Route53 entries to shift traffic over to the new version of the service. For example, we can configure an application to shift 5% traffic over every 5 minutes. It would take 1h40m to finish the transition, but it would give time for the newer version to spin up as additional traffic comes in.

6) Are the issues mainly coming from a small percentage of total containers?

How do I answer this question?

The easiest way to discover if this is the case is to compare the average CPU for the service with the max CPU. If the max CPU is significantly higher, it’s possible that there are one or two tasks that experienced a bug and are stuck in a loop, consuming CPU and memory without responding properly to requests

  • Log into the AWS console
  • Navigate to the ECS console
  • Find the ECS cluster that the problematic app is running in
  • Search for the application name in the services tab
  • Select the affected service and look at the “monitoring tab”
  • There should be information on max CPU and average CPU under that tab
  • The max/avg CPU comparison can also be done on the scaling alarms, which show more detailed information.

Possible problem description

Sometimes, a container/task experiences an error that causes it to become unresponsive. This may result in 502 errors for requests that hit that particular container/task.

Possible resolution

The easiest way to handle these is to just stop the task and allow AWS autoscaling to bring up a replacement.

7) Is the application Spring Boot Based (Embedded Tomcat Configuration)?

How do I answer this question?

Check if the application is a Tomcat-based application.

Possible problem description

  • If none of the other options in this troubleshooting guide have helped resolve the 502 issue, it’s possible that it’s a problem with the Tomcat configuration.
  • Tomcat applications come with default configuration in most cases, and these default values may prevent an ECS task from processing as many requests as it would be able to otherwise.
  • There are 2 general configuration items that should be considered when running a Tomcat server: thread count and min/max heap settings. Let’s examine each one individually.

How bad thread count settings can cause 502s

  • By default, Tomcat servers only handle 200 threads.
  • This means that each ECS task can only handle 200 requests at any given time by default even if the hardware it’s running on can support more than 200 requests.
  • If all threads are in use across all available ECS tasks for the given service, any new requests will be stopped by the ELB. If not handled within the timeout window, these requests will time out, resulting in 502s.

How bad heap settings can cause 502s

  • If the maximum memory available for the heap less than the memory allocated to the task’s container, then some resources will be wasted.
  • This may increase the total number of 502s if the max task count is not high enough since the Tomcat servers running on the ECS container may not have access to the memory they require to serve requests.

Possible resolution

Resolving 502 errors caused by bad thread count settings

There are 2 ways to handle the thread count being maxed out:

  1. IF the average CPU and memory usage is very high (~80%) then it’s likely that the threads being used are taking advantage of all of the resources that are available to them within the ECS container. If this is the case, increasing the thread count will not improve performance since the resources are already in use. In this case, the resolution is very similar to Question 1: either increase the CPU and memory for a single task and then increase the thread count for Tomcat, OR simply increase the max task count to deploy more tasks. Please refer to Question 1 to determine which option is best.
  2. IF the average CPU and memory usage is low and the application is still seeing 502 errors, increase the thread count in Tomcat’s configuration settings. Instructions for modifying the thread count are here.

NOTE: Determining the correct number of threads can be a complicated process and there is no general rule of thumb. Below are some useful resources

Resolving 502s caused by bad heap settings

Increase the heap until Tomcat consumes the resources available. This should be close to the total memory allocated to the container. Instructions on changing the heap size for Tomcat servers are here.

8) Is the load balancer still sending traffic to a task in the service, even if the TCP connection is torn down?

How do I answer this question?

Compare the ELB’s idle timeout to the application’s keepAlive value. If the ELB’s idle timeout value is greater than the application’s keepAlive value, this may cause problems

Find the ELB’s idle timeout value by checking the settings on the load balancer itself in AWS.

Find the application’s keepAlive value will depend on what type of application it is:

Possible problem description

If the ELB timeout value is greater than the application’s keepAlive value, it can cause problems when rotating out new tasks. Let’s say traffic on a service dropped off at the end of the day, and the service tried to scale down. One of the tasks is preparing to spin down as a result of the down scaling, and continues to receive requests for the duration of the application’s configured keepAlive value. Once this time expires, the application will no longer take new requests. If the timeout value configured on the ELB is greater than the keepAlive value configured on the application, the ELB might still send requests to the task that’s scaling down. This would result in 502 errors.

Possible resolution

Short-term:

  • In order to test to see if this is the issue, it’s possible to change the timeout value on the ELB directly without redeploying. This may help determine whether this issue is the cause of the 5xx errors without having to push the app all the way through a new pipeline.

Long-term:

  • Decrease the timeout value in your deployment configuration (Cloudformation, terraform, etc). This will ensure that future deployments don’t experience this issue and is the better approach from a best-practices standpoint.

Ok, so it’s probably not infrastructure-related. Where else should I look?

Below is a list of common non-infrastructure related causes of 502 errors. It’s not comprehensive, but it might help get you started.

  • No amount of task count/memory/CPU optimization will be able to make up for a bug in the code, if the bug causes the task’s CPU or memory usages to spike for a significant amount of time.
  • DNS caching is enabled, causing the task to hit an outdated IP address for a downstream service. Check here for an example of how to change the TTL in Java.
  • A downstream application is throwing 502s: either implement a retry on the call to the downstream app, or go through the troubleshooting process above for the downstream app.

Credit

Thanks to the following people for all of their input on this document, and if you see anything that could be added please comment and let us know!

  • Steve Norton
  • Manik Anand
  • Tommy Orndorff
  • RajaSekhar B

--

--