AWS ALB, Docker, uwsgi & 502 Bad Gateway

Investigate AWS ALB 502 Bad Gateways with uwsgi

A few weeks ago, some of the backend services we were running in production started throwing a small number of 502 Bad Gateway errors. While the number was relatively low (under 1%) it was causing customers significant issues.

This article is going to walkthrough our troubleshooting steps, our mistakes and our solution.

Troubleshooting

The backend services throwing the error are written in Python/Django and are deployed on Amazon AWS using mod uswgi.

Step 1: Identifying the root cause

Usually when something unexpected happens you look at two things:

  1. logs and monitoring
    • a 502 Bad Gateway error usually means the LoadBalancer (ELB/ALB in our case) can’t talk to the upstream service. We checked the logs and uptime of our backend service and they didn’t show anything out of the ordinary: there were no error logs / access logs and the service showed 100% uptime during the time of the errors
    • the only evidence of the error we found was in the ALB logs. However this was not very helpful, it only showed that the 502 error was happening

2. changes since last know stable version
We had just moved our backend service to be run as a Docker container. Given that the errors started happening around the same time as this change, we knew that the problem had to do with the new way we were deploying our service.

We moved from:
• Apache Web Server with mod_wsgi hosting our Django application deployed on EC2 instances. The instances sit behind the Classic AWS Load Balancer
to
• Docker container running uswgi deployed on EC2 instances. The instances sit behind the AWS Application Load Balancer.

Step 2: Repro steps

Since we had an idea where to look, we needed to figure out how to look. We couldn’t reliably reproduce the error on any of our environments (prod, staging, development or locally). They would just randomly pop up while using the application.

We turned to Google, Stackoverflow to see if other people were having the same issue. Unfortunately we didn’t find any answers/articles that had similarities to our problem. The only thing we knew (from our logs and initial investigation) was that:

For a very small number of requests, our AWS ALB was seeing the upstream service as down even though it was not.

We ran across this article on AWS outlining a bunch of things to investigate when dealing with HTTP 502: Bad Gateway: https://docs.aws.amazon.com/elasticloadbalancing/latest/application/load-balancer-troubleshooting.html

One of them caught our attention:

The target closed the connection with a TCP RST or a TCP FIN while the load balancer had an outstanding request to the target. Check whether the keep-alive duration of the target is shorter than the idle timeout value of the load balancer.

Since our application was memory intensive, to avoid any huge memory leaks we configured the cheaper process spawning algorithm; this would allow the number of workers to scale up and down based on usage. That got us thinking: What if uWSGI was restarting the worker, but the ALB would still try to send the request to the dead worker and thus generating the 502.

Step 3: Test Setup

To test this assumption we ran the following scenario that simulated intermittent load:
• put high load on our environment to force the cheaper algorithm to span new workers
• stop the load, and wait for the idle workers to be killed
• put high load back on the servers and see how it what happens
To simulate high load we created a slow test endpoint that would respond in 1s-4s to keep the workers busy and used the Siege tool to simulate concurrent users.

Scenario:

  • ran $ docker stats — to monitor the workers (PIDS) scale up and down
  • ran $ docker logs -f <container_id> — to watch logs for uWSGI workers lifecycle
  • ran $ siege -c 50 -r 10 https://test-endpoint -v — to run 50 concurrent requests 10 times on the slow test endpoint to simulate high load

While running the siege command we could see the number of PIDS going up and noticed, in logs, uWSGI spawning new workers:

When the siege command finished and no traffic was hitting the server we saw the workers getting killed (which was the expected behavior)

As soon as we saw one of the uWSGI workers get killed we immediately kicked off the siege command again and, the second we did, we saw the 502 Bad Gateway error.

Step 4: Fixing the issue

When deploying uWSGI it is recommended to have a web server in front of it. Usually this is NGINX. While we knew this when we started our initial implementation, we wanted to see if the AWS Application Load Balancer was enough. It looked like it wasn’t, since it didn’t always know how to handle the restarting/killing of uWSGI workers.

So the obvious solution was to add NGINX between AWS ALB and our uWSGI application. In the end our implementation would look like this:

To achieve this we change the way we were building and running our Docker Container. We moved away from the python:2.7-alpine3.7 base image and replaced it with tiangolo/uwsgi-nginx:python2.7-alpine3.7.

Conclusion

  • Running uWSGI directly behind an AWS Application Load Balancer might throw 502 Bad Gateway errors when uWSGI workers are killed or restarted.
  • Adding NGINX between the ALB and your application fixed the issued.
  • If you are deploying uWSGI applications inside Docker containers you should check out: https://github.com/tiangolo/uwsgi-nginx-docker since it comes with all the required packages and configuration

Written by

senior system architect. privacy & regulation advisor (GDPR, CCPA, PCI, KISA, etc... )

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store