The Day A Bug Was Fixed Only Because The CEO Called In 🔊
Ok, maybe not really "fixed."
There was a bug very hard to reproduce on the website of a big financial company. For some users, if they tried to deposit money into their account using the credit card, there was a chance the system could duplicate the deposit and charge the credit card twice. The user would see two transactions of the same amount in their balance.
After the first user reported the bug, the development team concluded the best action was to do nothing. The bug was rare, and the cost necessary to investigate wasn’t worth it. The call center could quickly reverse the transaction every time the problem happened.
Then one day the CEO called in.
He was in a taxi depositing money with the phone through the website. Once he entered a tunnel, the phone lost signal. When he exited, there were 2 deposits of the same amount in his account.
There was this tricky bug in a website where the user could get their credit card charged twice in a duplicated deposit.
The team felt compelled to look at it.
The logs had two requests with the same unique IDs. It couldn’t be a bug in the application code. Otherwise, the client would have to generate the same ID for two separate requests, which is extremely unlikely.
The HTTP spec states that if the client sees the connection close after the browser initiates a request, the browser should attempt to rerun the request. The astonishing thing is that it didn't restrict the behavior for idempotent requests such as PUT, it also allowed the retry to happen for POST requests that can have side-effects.
The specification said:
If an HTTP/1.1 client sends a request which includes a request body […] and if the client sees the connection close before receiving any status from the server, the client SHOULD retry the request. […]
The bug happened around 2014 and seemed to be due to the automatic retry for HTTP requests. This unexpected behavior has also caught the attention of other developers. 15 years after the first RFC, another one came out as an update.
The new RFC said:
[…] The requirement to retry requests under certain circumstances when the server prematurely closes the connection has been removed. […]
As a solution to this issue, the team came up with a server-side middleware that would read the request ID and store it in the database. If the server received another request with the same identifier, it would consider a duplicate request and ignore it.
The code was something like this:
Interestingly, today if you send a POST request to a Node.js server from the browsers Chrome, Firefox, Edge, or Opera, and it stays pending for more than 120 seconds, they send that same request again and then fail with an error message.
In this case, the retry happens because the default Node.js HTTP Server timeout is 120 seconds. After that time, Node.js destroys the sockets bound to that request automatically.
Safari and IE won’t retry the request again. They fail with an error message instead.
Maybe that's what happened with the CEO.
He used to travel a lot using the Harbour Tunnel, the busiest tunnel connecting the Central Business District and North Sydney. Then one day he initiated a deposit before entering the tunnel, the request got stuck in a weak signal for more than 120 seconds, and the browser did a retry. The server managed to process both requests, hence creating the duplicated entries.
Despite what happened, this bug was never a priority. Sometimes there are more important things worth your time than trying to find a bug that can only result in an inconvenience to a handful of people. However, because the CEO coincidently stumbled upon it, the team felt compelled to take a look.
It’s interesting to see in practice how the behavior of a widely used protocol can be so astonishing. A developer expects the request to fail if the client doesn't receive a meaningful response from the server, not for the request to be sent again. It makes me wonder how many bugs are out there and nobody cares about, just because very few people are affected and the difficulty to reproduce is too high.
Sometimes, it's not worth to fix a bug that is hard to reproduce and can result only in an inconvenience to a handful of people. How many bugs like this could be out there?
If the CEO hadn’t called in, the team would never have discovered this weird browser behavior, and the bug would have fallen into oblivion.
One year after the original bug, the team received two new reports of a similar problem with deposits. This time it couldn’t be the HTTP retry, it had to be something else.
However, the team concluded once more that the effort to investigate was not worth the cost. Up to this date, it remains something nobody has ever taken a look.
Of course, until another executive calls again complaining about it.