Tower of Hanoi Timeouts
I have worked with lots of large scale applications, and one universal truth holds: every large scale application that I have worked with shares a common flaw — they all totally screw up the timeouts at some point. It always bring me back to a philosophy on timeouts that I heard very early in my career:
No child request should be able to exceed the timeout of the parent.
This is called a Tower of Hanoi Timeout. It seems super simple and obvious, but the implications run very deep. No timeout should ever be allowed to be longer than the caller’s timeout.
What’s the worst that can happen?
I have heard this reasoning many times while pointing out timeout issues. For me, the worst that could happen is serving astronomic error rates during the largest event in the history of the company. Large enough that we were featured, yet again, in news articles about our failures. Jonathan Reichhold did a good job of documenting this class of failure at Velocity.
Bad timeout management meant that the whole site was serving errors, while individual components were actually succeeding and reporting that they were healthy. This lead to a situation where the frontends were serving 200's while the customers were getting 500's from our HTTP proxies.
The problem can be boiled down to a simple situation: The web servers were completing requests and serving a success after the upstream connection had already been closed and the client had been served a 500. Even worse, the 200's would take so long to serve that the client would reload, therefore having 2, 3 or more of the same request in flight at a time. This increased load on the backend which in turn made things run slower, which in turn increased the number of inflight connections, and so on.
In order to be a proper Tower of Hanoi Timeout the backend shouldn’t have been able to be processing the request after the frontend had already given up and served an error.
How to avoid this mistake.
One of the most often overlooked timeout failures is forgetting that serial timeouts are additive, as are timeouts on retries. If your app queries Memcached with a two second timeout, mysql with a five second timeout and then can add an email to a queue with a timeout of one second then a query can by successful and still take eight seconds. If you automatically retry on a mysql failure then it can take 13 seconds, etc.
Now, I usually hear the argument that it’s super rare that every single backend would take the maximum time and still return successful. This is totally true, but why be lazy? An even easier way around this is to keep track of when you must reply by, and simply subtract the current time from that. This way you can bail out of a query much earlier if successive, successful queries are just taking too long to finish on time.
Never, ever forget about overhead.
Every single call will have overhead. Again this seems simple, but this overhead can drastically impact your service. Say for example your backend is running java. It can suddenly get into a long GC pause which delays the processing of the next query in the queue. If this pause takes five seconds then you may be dealing with a query that is already five seconds old before the first line of request handling starts.
This was the direct root of the issue that I mentioned earlier. One service would connect and send the whole connection request, including headers, which would be buffered by the kernel on the receiving machine. After a few seconds it would close the connection and serve an error. The back end process would read the headers, process the full request, and then try to write a reply only to find out that the connection was closed already. The EOF was not delivered until the next request would be read. When we started debugging we saw requests that started processing in the backend upwards of 40 seconds after the connection had been closed.
Try using deadline timers.
One way to handle timeouts better is to ensure that deadlines are passed from the caller to the called. If your app is calling a backend and knows that it needs to finish in 3 seconds, it can ideally pass that information along with the query (either via a header, or a query parameter) so that the receiving service knows how long it can wait for calls that it initiates to finish. This works best in a cluster where clocks are synced so you can send the absolute time that the query will timeout, but if that is not available you can still send a relative timeout with the hope that overhead will be small enough to fit in the window. The first is clearly more precise, but even relative deadlines are better than nothing.
In the example above the solution was fairly simple. We added a request header that included the time that the request started, and then in the back end we deduced the time that it would be timed out by our web server. If we were within 500ms of that time we served a cheap and quick error.
Improve your error handling with proper timeouts!
Controlling timeouts means that you maintain control over your error messages. Leaving timeouts to your load balancer means that you get useless error messages that most users don’t understand. Handling all errors generically in the frontend can leave users frustrated at the lack of information. If you don’t let processing run away from you then you can craft a meaningful error message for the user right where things have gone wrong, rather than relying on generic messages.
Once you start paying attention to this philosophy you will see all the places that violating it goes wrong. In some cases software (or vendors) will actually request that you do the exact opposite. Recently the Amazon AWS support staff were insisting that the timeout on services behind ELB needs to be larger than on the ELB. If you do this then your ELB will respond with a blank page while your web server continues to process the request. Avoid this by responding with your own error messages.