Timeouts

Knock! Knock! Who’s there? … … … … 504 Gateway Timeout 😱

Grégoire Paris
ManoMano Tech team
6 min readOct 9, 2019

--

In network computing, a timeout occurs when a program attempts to get some information from another program through a socket, but fails to do so before a predefined number of seconds. Note that the second program does not have to run on a remote machine for this to occur.

I am a web developer, and in my context this can happen a lot of times:

  • when my computer tries to get an IP from a DNS server;
  • when nginx, my webserver, tries to get a response from php-fpm, my application server;
  • when my application tries to get a response from a database;
  • or from a key-value store;
  • or from any microservice.

Once in a while, I can see the same scenario unfold again and again in conversations between developers. It goes like this:

― “Help! I’m getting lots of timeouts on microservice X”

A few moments later:

― “Nevermind I fixed it, the solution was to change the timeout from one second to two seconds. 👍”

This makes me shiver for tworeasons:

  • Microservice X is slow, and if we accept that change, the main application will be waiting for it for one second longer every time it calls it, and will become slower itself.
  • This brings back painful memories, memories of events that taught me the hard way that slowness will only often be the harbinger of a big downtime, consisting of 504 HTTP status codes, but also of some more sinister 502 HTTP status codes.

Why 502 Bad Gateway responses, you may wonder? It’s a bit complicated, but I have a nice story to illustrate that point. Let me tell you the tale of Josiane from accounting who single-handedly and repeatedly took down the website of my former company in the middle of the day, when things were pretty quiet (that company is a VoD company). Names are changed to protect the innocent and the guilty. The application had a particular page that allowed to export sales, and that export was quite long to complete. And sometimes, Josiane ran the export and selected a timeframe that contained too many rows for the export to complete in time.

The nginx timeout on that particular page had been set to one minute or so in order to allow the export to complete, and a call to set_time_limit() had been made to allow PHP to run a bit longer on that page. On seeing the resulting 504 Gateway timeout, Josiane, bless her heart, as instructed by the IT team, would press retry with a shorter timeframe, or just press F5 if she thought the timeframe was already reasonably small. The rationale was that some caching systems might be warmer the second time, and she might be able to get a response.

That 504 did not mean that the previous export process had crashed though. Rather, nginx had just abandoned the hope to receive a response from php-fpm. php-fpm merrily carried on its way until reaching its own timeout or simply fulfilling the request, who knows? Oh, and by the way, set_time_limit() only affects the execution time of the script itself, which means that any long call to a database server would not be taken into account by it, and let the script run for a bit longer than what was intended.

That means after Josiane pressed F5, we now had two php-fpm workers trying to fulfill the same (heavy) request. But what happens if Josiane presses F5 enough times in a short enough timespan? At that moment, all php-fpm workers are taken, trying to fulfill the request, and that’s how you can get a 502 Bad Gateway 💥. I use “can” here, because that’s how the things were configured on that server but you might get a 504 instead, which is worse in my opinion, because it makes one think the issue has to do with pages of the website that are in fact not at fault.

Josiane, unknowingly performing a Denial of Service attack

That’s a bit counter-intuitive, but in that kind of situation, increasing the timeout will only make things worse, because, while it makes it more likely that the script will complete, it makes the application more vulnerable to a denial of service (which is what Josiane innocently did). A solution that does not involve development would be to dedicate a separate php-fpm pool to the route that is slow, so that it does not have a negative impact on the rest of the application. A better solution is to ensure you cannot run that script in parallel, by using locks. But a proper solution in my opinion is to defer things to a background script, by using a messaging system like RabbitMQ.

This ensures that:

  • the script has all the time it needs to complete;
  • tasks will not be run in parallel unless you specifically allocate more workers to them;
  • the user gets instant feedback, telling them their task is running and they will get the result via email or on a shared drive or whatever, and can move on with their life 🙂 instead of waiting endlessly in front of a screen not telling them anything is happening.

Ok. So long timeouts are bad. But you know what is worse than long timeouts? Long timeouts combined with locks. These can happen when using a lot of AJAX requests in parallel with the same session cookie. In that sort of scenario, php workers wait on a lock for a session file, while another request that managed to lock that same session file typically performs a costly SQL query. This can be very confusing because you can get 502 / 504 errors, and log into your production server only to see a low CPU/RAM usage… (assuming your database is hosted on a separate server) one way to understand what is happening is to use strace on php-fpm workers, but that is not really recommended on a production system. A good solution to that kind of issue is to make sure that AJAX calls that can be stateless actually are.

Will all that in mind, let us examine how timeouts are configured by default on a typical PHP web app stack:

As you can see, there is not much that is configured with sub-second response times in mind 😅. Nginx will wait for php-fpm one minute long before giving up, and by default, a php-fpm worker will be allowed to continue forever if it doesn’t do much php but many system calls.

Regarding databases, there are connect timeouts you can configure, but there does not seem to be any way to configure a read timeout for when you can connect but it takes forever to actually get a response. There are ways you can limit the time allotted to SQL requests, globally or not though.

Timeout issues are not issues that happen very often, and I think this is why people don’t know much about them, and maybe why software does not seem to be very reasonably configured by default, or even provide ways to configure a timeout at all. When these issues happen though, they are quite often hard to reproduce, and when you do reproduce them 100% of the time they can still be hard to troubleshoot, and when you finally pinpoint them they can still be hard to fix (migrating to Producer-Consumer when you don’t have an infrastructure for that can take a lot of time, for instance).

My two cents: when you do have the possibility to configure a timeout, make sure you give it some thought. This can involve having a look at your logs to see what response times you should expect, or evaluating how critical the feature you are providing is. If it isn’t and you think the application you are working on can live without that feature, applying the circuit breaker design pattern might be a good idea.

Hopefully this blog post will help you foresee that kind of issues and not fall in the same traps as I did. Don’t hesitate to share your own war stories, pieces of advice and configuration tips in the comments, I’m very interested in having feedback on this topic!

--

--