The Startup
Published in

The Startup

Troubleshoot Elastic Load Balancer HTTP 502 Bad Gateway Errors

Photo by Etienne Girardet on Unsplash

This article recorded how we fixed the HTTP 502 bad gateway errors with our web application.

The problem

We recently encountered an issue with our web application. The issue was first reported by our customers that our web application was getting slower and slower, and eventually down and not usable. Sometimes they could login but all the pages just stopped working properly and often hung there.

As we added a lot of new users to our application recently, our first reaction was that the system must have a bottleneck somewhere.

Our web application is a typical web application hosted on AWS. We use EC2 cluster to host the containerized web server and api server. Then the EC2 clusters are behind the two Application Load balancers, one for the web server www.example.com, and one for the API server api.example.com. Users login to www.example.com and access the web application.. Each web page will call the restful apis from api.example.com to retrieve and update data. We also have mobile apps access the apis of api.example.com directly. Behind the api.example.com is the mongodb database.

Fortunately we were able to duplicate the issue. We used chrome to login to www.example.com. Then we use the F12 developer tools to monitor the network traffic and check any exceptions from the console. We saw that the api calls failed randomly.

We first saw some exceptions from the console.

We were confused at first glance. How come we have CORS issues? We configured everything properly on the API server. If we really have a CORS issue, it should fail on every API call. Not just randomly failed on 1 of 50 calls.

After we understand how CORS works:

https://en.wikipedia.org/wiki/File:Flowchart_showing_Simple_and_Preflight_XHR.svg

We realized that it was not a CORS issue. It just happened that when the OPTIONS/preflight call failed, chrome thought that the failure was caused by the CORS policy. But for our case, the random failure of the API call, when it happened to OPTIONS/preflight call, chrome complained about the CORS policy, when it happened to another call like GET, chrome will just display the http 502 bad gateway error.

It was even more confusing when chrome’s DevTools network tab by default does not capture/display the OPTIONS/preflight call. We had to go to the following url:

chrome://flags/#out-of-blink-cors

and disable the [Out of blink CORS] then restart chrome. Now we could see those OPTIONS/preflight api calls failed with HTTP 502 bad gateway error.

We were also able to duplicate this issue and saw some GET API calls failed randomly.

We found an easy way to duplicate this issue. We just kept clicking the different rows of a list data that would cause the browser to keep sending the preflight call and the GET call of the API server. When I clicked quickly enough like 3 clicks per seconds, after 10–20 seconds, chrome would display a HTTP 502 bad gateway error either on the preflight call or the actual GET call.

After we excluded the possibilities of CORS issues, we started to think about the backend.

We tried one thing first which is scaling up the mongodb atlas service. We thought mightbe because we got more users recently, the mongodb might not be able to handle the traffic and became the bottleneck of the system. After we scaled up the mongodb, we found out it didn’t help.

Then we went to the load balancer.

Application Load Balancer Logs

The Application Load Balancer logs every request. The Access Logs for the Application Load Balancer are saved in the S3. We can use the AWS Athena to easily query the logs and find the useful information.

We found the requests that come from my browser by using my client ip address and the time range with where clause.

What information could we find with those two lines of logs?

After reading carefully of the explanation of each field of the log in the document, the following fields are very helpful for the troubleshooting:

request_processing_time: The total time elapsed (in seconds, with millisecond precision) from the time the load balancer received the request until the time it sent it to a target. This value is set to -1 if the load balancer can’t dispatch the request to a target. This can happen if the target closes the connection before the idle timeout or if the client sends a malformed request. This value can also be set to -1 if the registered target does not respond before the idle timeout.

target_processing_time: The total time elapsed (in seconds, with millisecond precision) from the time the load balancer sent the request to a target until the target started to send the response headers. This value is set to -1 if the load balancer can’t dispatch the request to a target. This can happen if the target closes the connection before the idle timeout or if the client sends a malformed request. This value can also be set to -1 if the registered target does not respond before the idle timeout.

response_processing_time: The total time elapsed (in seconds, with millisecond precision) from the time the load balancer received the response header from the target until it started to send the response to the client. This includes both the queuing time at the load balancer and the connection acquisition time from the load balancer to the client.

This value is set to -1 if the load balancer can’t send the request to a target. This can happen if the target closes the connection before the idle timeout or if the client sends a malformed request.

elb_status_code: The status code of the response from the load balancer.

target_status_code: The status code of the response from the target. This value is recorded only if a connection was established to the target and the target sent a response. Otherwise, it is set to -.

In our case, we have following values:

  • request_processing_time: 0
  • target_processing_time: xxx s
  • response_processing_time: -1.0
  • elb_status_code: 502
  • target_status_code: -

We had the following conclusions:

  • Load balancer did received the request from my browser
  • The 502 error was from the load balance not the api server because target_status_code always empty
  • Load balancer failed to send the request to the api server because response_processing_time was always -1

The question now became:

Why did the load balancer fail to send the request to the api server?

We searched the similar issues and found some users complained about the similar issues and fixed them by delete the load balancer and re-create the load balancer.

We tried that, but it didn’t help.

Then we went to check the api server’s logs.

Application Server Logs

Fortunately, our George and the Unbreakable Code server had a very detailed logging, and the logs were sent to the CloudWatch Logs.

We can easily query the cloudwatch logs with the CloudWatch Logs Insights.

I tried to find the logs from my browser by limiting the timestamp range.

fields @timestamp, @message| filter @message like /\/api\/tdata\/5ec94f6f3710a30019eb10e4/ AND @timestamp > 1590458407000 AND @timestamp <= 1590458587000| sort @timestamp desc| limit 100

Notice that the timestamp only accepts the epoch time. So I have to convert the time to epoch time.

But this production site had thousands of requests every minute. How could I filter the logs so that it only shows the requests coming from my browser?

I found out that the logs message fields actually logs the user agent of the http request. If I could make my request have a special user agent , then I can easily filter the logs by checking if the message has a special string.

Chrome DevTools allows us to set the custom user agent.

After I set the custom user agent with appending my name, I duplicated the issue again, and then I can use the following query to limit the result only from my browser.

fields @timestamp, @message| filter @message like /liuhongbo/ AND @timestamp > 1590458407000 AND @timestamp <= 1590458587000| sort @timestamp desc| limit 100

It worked. But, I could not find the failed 502 request in the logs. I could find the request right before the failed request and the request right after the failed request.

That just showing the request didn’t come to the API server !? But where did it go?

The ELB logs complained the api server did not respond and API server logs said it didn’t receive the request. The request got lost between the load balancer and the API server. Who should take the responsibility?

The cause

We sat back and thought this again. It was obvious that issue happened before the load balancer was able to send the request data to the API server. What would the load balancer do before sending the request?

Establish the TCP connection.

It failed to establish the TCP connection. Why did the load balancer fail to establish the TCP connection? Because the API server didn’t accept the connection. Why would the API server not accept the connection?

Thinking about the way I duplicated the issue by sending the request quick enough, I immediately thought of one possible reason: the TCP backlog.

Our api server used an open source rest api framework library. We went ahead to modify the api server by passing the backlog parameter explicitly and testing again.

It worked!

Conclusion

Not every 502 bad gateway error is caused by TCP backlog setting. But with the above mythology, I think we can locate the cause of the issue and find the root cause. And the key is the logging information from every component of the system that made the issue traceable.

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store