A tale of unexpected ELB behavior.
[This page moved to the sreally blog: https://sreally.com/a-tale-of-unexpected-elb-behavior/]
AWS is an awesome tool that provides a bunch of the infrastructure that you need to run a large scale service. ELB specifically allows you to setup a very quick load balancer that does a lot of heavy lifting for you. From terminating SSL for you to managing fail over it does a fairly good job of keeping a site running.
We started seeing random 504 errors being served via the ELB. We don’t use 504 as an error code in our application so this was something that must be coming from the load balancers. Typically 504, or “Gateway Timeout”, means that your backends are overloaded (or that they have all failed health checks), except in this case our backends were completely healthy and had each served hundreds of requests in the 2–3 second windows around the 504 errors.
Even worse, the ELB logs specifically removed all the useful information from the log line.
2015–09–07T11:24:40.052149Z elb1 111.222.33.44:47449 - -1 -1 -1 504 0 99 0 “GET /” ECDHE-RSA-AES128-SHA TLSv1
Note that the log line is missing the backend that it talked too (replaced by “-”), and all of the query timing variables are -1 which means that we had no idea how long the query took. Even though we log every single request attempted, regardless of outcome we didn’t have any correlated log lines during the request window.
Amazon support kept pointing us at timeout settings, but no matter what they suggested it seemed impossible to recreate that specific type of 504 with that log line. The suggestion coming from AWS was to ensure that the timeout was set higher on the backend and the ELB. To me this seemed problematic as it violates the Tower of Hanoi Timeout model. I wanted a reproducible failure case so I could determine exactly what was going wrong.
While debugging a coworker pointed me at an article by Colin Johnson about ELB pre-opening of connections. Basically, he points out an undocumented function of the ELB where it will open a connection early, and then do nothing with it until a client makes a request at which point the ELB starts sending the request to the pre-established connection. This model is nice in that it helps reduce the time required for new requests by already having a connection open.
The problem with this is that some HTTP frameworks will close new connections using a totally different timeout model, or will close the unused connection when restarting, etc. This would be fine but it introduces a race condition where the ELB thinks that the connection is open, but the backend has closed it. In this case the ELB will attempt to write data, which will be rejected by the backend since the connection is closed. The ELB will then immediately serve a 504 with the log line above rather than retrying a different backend.
Our solution was to request that AWS disable pre-opening on our ELBs until we can ensure that DropWizard has a different timeout value for the initial connection than it does for reading/writing responses.