AWS ALB, Docker, uwsgi & 502 Bad Gateway

Investigate AWS ALB 502 Bad Gateways with uwsgi

Gabriel M. Troy
May 1 · 5 min read

A few weeks ago, some of the backend services we were running in production started throwing a small number of 502 Bad Gateway errors. While the number was relatively low (under 1%) it was causing customers significant issues.

This article is going to walkthrough our troubleshooting steps, our mistakes and our solution.


Troubleshooting

The backend services throwing the error are written in Python/Django and are deployed on Amazon AWS using mod uswgi.

Step 1: Identifying the root cause

Usually when something unexpected happens you look at two things:

  1. logs and monitoring
    • a 502 Bad Gateway error usually means the LoadBalancer (ELB/ALB in our case) can’t talk to the upstream service. We checked the logs and uptime of our backend service and they didn’t show anything out of the ordinary: there were no error logs / access logs and the service showed 100% uptime during the time of the errors
    • the only evidence of the error we found was in the ALB logs. However this was not very helpful, it only showed that the 502 error was happening
h2 2019-04-12T11:12:04.281739Z app/cex-prod-alb/8c2b504168120906 5.2.148.252:41156 10.0.3.116:8006 0.000 0.001 -1 502 - 77 610 "GET https://... HTTP/2.0" .....

2. changes since last know stable version
We had just moved our backend service to be run as a Docker container. Given that the errors started happening around the same time as this change, we knew that the problem had to do with the new way we were deploying our service.

We moved from:
• Apache Web Server with mod_wsgi hosting our Django application deployed on EC2 instances. The instances sit behind the Classic AWS Load Balancer
to
• Docker container running uswgi deployed on EC2 instances. The instances sit behind the AWS Application Load Balancer.

# example of Dockerfile
FROM python:3.6-alpine3.7
....
uwsgi
--chdir=. \
--module=xxx.wsgi:application \
--env DJANGO_SETTINGS_MODULE=xxx.settings \
--master \
--pidfile=/tmp/project-master.pid \
--processes=10 \
--cheaper=1 \
--cheaper-overload=30 \
--cheaper-step=1 \
--threads=4 \
--enable-threads \
--harakiri=60 \
--max-requests=200 \
--vacuum

Step 2: Repro steps

Since we had an idea where to look, we needed to figure out how to look. We couldn’t reliably reproduce the error on any of our environments (prod, staging, development or locally). They would just randomly pop up while using the application.

We turned to Google, Stackoverflow to see if other people were having the same issue. Unfortunately we didn’t find any answers/articles that had similarities to our problem. The only thing we knew (from our logs and initial investigation) was that:

For a very small number of requests, our AWS ALB was seeing the upstream service as down even though it was not.

We ran across this article on AWS outlining a bunch of things to investigate when dealing with HTTP 502: Bad Gateway: https://docs.aws.amazon.com/elasticloadbalancing/latest/application/load-balancer-troubleshooting.html

One of them caught our attention:

The target closed the connection with a TCP RST or a TCP FIN while the load balancer had an outstanding request to the target. Check whether the keep-alive duration of the target is shorter than the idle timeout value of the load balancer.

Since our application was memory intensive, to avoid any huge memory leaks we configured the cheaper process spawning algorithm; this would allow the number of workers to scale up and down based on usage. That got us thinking: What if uWSGI was restarting the worker, but the ALB would still try to send the request to the dead worker and thus generating the 502.

Step 3: Test Setup

To test this assumption we ran the following scenario that simulated intermittent load:
• put high load on our environment to force the cheaper algorithm to span new workers
• stop the load, and wait for the idle workers to be killed
• put high load back on the servers and see how it what happens
To simulate high load we created a slow test endpoint that would respond in 1s-4s to keep the workers busy and used the Siege tool to simulate concurrent users.

Scenario:

  • ran $ docker stats — to monitor the workers (PIDS) scale up and down
  • ran $ docker logs -f <container_id> — to watch logs for uWSGI workers lifecycle
  • ran $ siege -c 50 -r 10 https://test-endpoint -v — to run 50 concurrent requests 10 times on the slow test endpoint to simulate high load

While running the siege command we could see the number of PIDS going up and noticed, in logs, uWSGI spawning new workers:

When the siege command finished and no traffic was hitting the server we saw the workers getting killed (which was the expected behavior)

As soon as we saw one of the uWSGI workers get killed we immediately kicked off the siege command again and, the second we did, we saw the 502 Bad Gateway error.

Step 4: Fixing the issue

When deploying uWSGI it is recommended to have a web server in front of it. Usually this is NGINX. While we knew this when we started our initial implementation, we wanted to see if the AWS Application Load Balancer was enough. It looked like it wasn’t, since it didn’t always know how to handle the restarting/killing of uWSGI workers.

So the obvious solution was to add NGINX between AWS ALB and our uWSGI application. In the end our implementation would look like this:

web-client -> AWS ALB -> NGINX -> socket -> uWSGI -> Django

To achieve this we change the way we were building and running our Docker Container. We moved away from the python:2.7-alpine3.7 base image and replaced it with tiangolo/uwsgi-nginx:python2.7-alpine3.7.

# Dockerfile
FROM tiangolo/uwsgi-nginx:python2.7-alpine3.7
ENV UWSGI_INI /app/uwsgi.ini
ENV UWSGI_CHEAPER 2
ENV UWSGI_PROCESSES 4
ENV NGINX_WORKER_PROCESSES auto
.... copying project files and install dependencies
COPY uwsgi.ini /app/uwsgi.ini
# uwsgi.ini
[uwsgi]
chdir=/app/
module=xxx.wsgi:application
home=/app/env
DJANGO_SETTINGS_MODULE=xxx.settings
cheaper-step=2
cheaper-algo=spare
harakiri=60
vacuum=True

Conclusion

  • Running uWSGI directly behind an AWS Application Load Balancer might throw 502 Bad Gateway errors when uWSGI workers are killed or restarted.
  • Adding NGINX between the ALB and your application fixed the issued.
  • If you are deploying uWSGI applications inside Docker containers you should check out: https://github.com/tiangolo/uwsgi-nginx-docker since it comes with all the required packages and configuration

Gabriel M. Troy

Written by

senior system architect. privacy & regulation advisor (GDPR, CCPA, PCI, KISA, etc... )

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade