AWS ALB, Docker, uwsgi & 502 Bad Gateway
A few weeks ago, some of the backend services we were running in production started throwing a small number of 502 Bad Gateway errors. While the number was relatively low (under 1%) it was causing customers significant issues.
This article is going to walkthrough our troubleshooting steps, our mistakes and our solution.
The backend services throwing the error are written in Python/Django and are deployed on Amazon AWS using mod uswgi.
Step 1: Identifying the root cause
Usually when something unexpected happens you look at two things:
- logs and monitoring
• a 502 Bad Gateway error usually means the LoadBalancer (ELB/ALB in our case) can’t talk to the upstream service. We checked the logs and uptime of our backend service and they didn’t show anything out of the ordinary: there were no error logs / access logs and the service showed 100% uptime during the time of the errors
• the only evidence of the error we found was in the ALB logs. However this was not very helpful, it only showed that the 502 error was happening
h2 2019-04-12T11:12:04.281739Z app/cex-prod-alb/8c2b504168120906 18.104.22.168:41156 10.0.3.116:8006 0.000 0.001 -1 502 - 77 610 "GET https://... HTTP/2.0" .....
2. changes since last know stable version
We had just moved our backend service to be run as a Docker container. Given that the errors started happening around the same time as this change, we knew that the problem had to do with the new way we were deploying our service.
We moved from:
• Apache Web Server with
mod_wsgi hosting our Django application deployed on EC2 instances. The instances sit behind the Classic AWS Load Balancer
• Docker container running
uswgi deployed on EC2 instances. The instances sit behind the AWS Application Load Balancer.
# example of Dockerfile
--env DJANGO_SETTINGS_MODULE=xxx.settings \
Step 2: Repro steps
Since we had an idea where to look, we needed to figure out how to look. We couldn’t reliably reproduce the error on any of our environments (prod, staging, development or locally). They would just randomly pop up while using the application.
We turned to Google, Stackoverflow to see if other people were having the same issue. Unfortunately we didn’t find any answers/articles that had similarities to our problem. The only thing we knew (from our logs and initial investigation) was that:
For a very small number of requests, our AWS ALB was seeing the upstream service as down even though it was not.
We ran across this article on AWS outlining a bunch of things to investigate when dealing with HTTP 502: Bad Gateway: https://docs.aws.amazon.com/elasticloadbalancing/latest/application/load-balancer-troubleshooting.html
One of them caught our attention:
The target closed the connection with a TCP RST or a TCP FIN while the load balancer had an outstanding request to the target. Check whether the keep-alive duration of the target is shorter than the idle timeout value of the load balancer.
Since our application was memory intensive, to avoid any huge memory leaks we configured the cheaper process spawning algorithm; this would allow the number of workers to scale up and down based on usage. That got us thinking: What if
uWSGI was restarting the worker, but the ALB would still try to send the request to the dead worker and thus generating the 502.
Step 3: Test Setup
To test this assumption we ran the following scenario that simulated intermittent load:
• put high load on our environment to force the cheaper algorithm to span new workers
• stop the load, and wait for the idle workers to be killed
• put high load back on the servers and see how it what happens
To simulate high load we created a slow test endpoint that would respond in 1s-4s to keep the workers busy and used the Siege tool to simulate concurrent users.
$ docker stats— to monitor the workers (PIDS) scale up and down
$ docker logs -f <container_id>— to watch logs for uWSGI workers lifecycle
$ siege -c 50 -r 10 https://test-endpoint -v— to run 50 concurrent requests 10 times on the slow test endpoint to simulate high load
While running the
siege command we could see the number of PIDS going up and noticed, in logs, uWSGI spawning new workers:
siege command finished and no traffic was hitting the server we saw the workers getting killed (which was the expected behavior)
As soon as we saw one of the uWSGI workers get killed we immediately kicked off the
siegecommand again and, the second we did, we saw the 502 Bad Gateway error.
Step 4: Fixing the issue
When deploying uWSGI it is recommended to have a web server in front of it. Usually this is NGINX. While we knew this when we started our initial implementation, we wanted to see if the AWS Application Load Balancer was enough. It looked like it wasn’t, since it didn’t always know how to handle the restarting/killing of uWSGI workers.
So the obvious solution was to add NGINX between AWS ALB and our uWSGI application. In the end our implementation would look like this:
web-client -> AWS ALB -> NGINX -> socket -> uWSGI -> Django
To achieve this we change the way we were building and running our Docker Container. We moved away from the
python:2.7-alpine3.7 base image and replaced it with
ENV UWSGI_INI /app/uwsgi.ini
ENV UWSGI_CHEAPER 2
ENV UWSGI_PROCESSES 4
ENV NGINX_WORKER_PROCESSES auto
.... copying project files and install dependencies
COPY uwsgi.ini /app/uwsgi.ini# uwsgi.ini
- Running uWSGI directly behind an AWS Application Load Balancer might throw 502 Bad Gateway errors when uWSGI workers are killed or restarted.
- Adding NGINX between the ALB and your application fixed the issued.
- If you are deploying uWSGI applications inside Docker containers you should check out: https://github.com/tiangolo/uwsgi-nginx-docker since it comes with all the required packages and configuration