By Charles Li
One of the frequently asked questions from new site reliability engineers is: Where to begin when troubleshooting a problem in a cloud environment? I always tell them: You should begin with understanding the problem. Let me demonstrate the reasons and methods with a real troubleshooting case.
There are many applications behind www.ebay.com. Each application serves a unique subset of the URLs, such as /a/* and /b/* in following example. The HTTP requests are distributed to the applications by layer 7 policies on the load balancer:
Policy A: if request = http://www.ebay.com/a/*, then send traffic to application-a
Policy B: if request = http://www.ebay.com/b/*, then send traffic to application-b
Policy C: if request = http://www.ebay.com/c/*, then send traffic to application-c
Default Policy: if request = http://www.ebay.com/any-other-url, then send traffic to application-x (the default application)
A client establishes a TCP connection with the virtual server on the load balancer and sends multiple HTTP transactions sequentially by reusing this TCP connection. The flow is illustrated as below:
Everything was fine until one day a developer reported a problem:
“Hey, I’m the owner of application-b. I deployed new code earlier today and have been monitoring the logs since then. The new code is serving the /b/* URLs without any problem. However, I noticed that my application is randomly getting other URLs such as /a/*, /c/*, and /index.html, which shouldn’t be sent to my application at all. It appears the layer 7 policies are not configured properly on the load balancer.
“Furthermore, if my application is getting other applications’ URLs, I’d assume some of my /b/* URLs could be mis-distributed to the other applications as well? If so, it might be impacting multiple data flows.”
The Site Reliability Engineering (SRE) team took the following steps to triaging this issue:
Step 1: Verify whether the alarm is true. The team checked the log of application-b and did see it randomly getting the other applications’ URLs.
Step 2: Scope the client side of the problem. Are the mis-routed URLs coming from the load balancer or from other clients connecting directly to application-b, bypassing the load balancer? The team checked the log of application-b for the source IP address of the misrouted URLs and found that all the source IP addresses belong to the load balancer. This confirmed that the misrouted URLs were indeed coming from the load balancer. The scope was narrowed down to the connections from the load balancer. The other sources were out of concern.
Step 3: Scope the server side of the problem. Is application-b the only one getting wrong URLs? Or the other applications are also getting wrong URLs? The team checked the log of the other applications behind www.ebay.com and confirmed that only application-b was getting wrong URLs, so the scope was further narrowed down to the connections between the load balancer and application-b.
Step 4: Scope the timing of the problem. The team checked the log of application-b to see when it started getting the wrong URLs. It turned out it started getting wrong URLs after the new code was deployed.
With this systematic approach, the scope was narrowed down to the connection between the load balancer and application-b. The problem statement became: After deployment of new code, application-b started to receive wrong URLs from the load balancers.
“Wait,” you may ask. “The flow is from client to load balancer to application. How could a downstream application attract wrong URLs from the upstream load balancer? It’s like a flood caused a hurricane, which is ridiculous, isn’t it?”
Yes, normally a downstream application couldn’t impact the decision on the upstream load balancer, just like it’s impossible for a flood to cause a hurricane. The result of the previous investigation appeared to contradict common sense.
What should the team do in this case? Well, remember during troubleshooting, the rule of thumb is this:
a) If you couldn’t find where the problem is, dig wider.
b) If you found something that can barely be explained, dig deeper.
In this case, the team should dig deeper by collecting the first hand data. They took a tcpdump for the HTTP transactions between the load balancer and application-b. Checking the raw data in Wireshark, a clear pattern was observed:
- Initially, application-b was getting the right URLs on a TCP connection.
- For certain HTTP requests, application-b sent HTTP 302 redirect back to the client, which was by design.
- Whenever application-b sent back an HTTP 302 redirect, it began to receive wrong URLs.
It appears that the HTTP 302 redirects triggered the problem. But how? Taking a closer look at the raw data in the tcpdump, it turned out that the HTTP 302 redirects were not ended properly. Bingo, the root cause of the problem was found.
Now let’s take a look at the deciphered version of the story.
According to RFC 7230, an HTTP response header must end with an empty line, which is a line contains nothing but CRLF (0D0A in Hex ASCII Code).
In the HTTP 302 redirect header generated by application-b however, the CRLF was missing.
Why does the missing CRLF make URLs such as /a/* or /c/* be routed to application-b? Well, in order for the layer 7 policies to work properly, the load balancer must be able to keep tracking and identifying the HTTP headers in the requests and responses. The missing CRLF makes the load balancer think the previous HTTP response wasn’t over, so the load balancer was confused and lost track of the HTTP headers in the subsequent requests and responses. It considers all the subsequent requests as part of the previous HTTP transaction, and routes them to application-b without applying any layer 7 policies, as illustrated below:
What is the next step? “Fix the code to end HTTP 302 with CRLF,” someone might say. In reality however, the top priority is to mitigate the impact first. What the team did was to rollback application-b to the previous known-good version, as this could be done much faster than fixing the code.
In parallel, the owner of application-b fixed the code, went through the QA process, and redeployed the code to production.
To wrap up the story, let’s look at the the takeaways. In summary:
1. Troubleshooting begins with verifying, confirming, and reproducing the issue.
2. A systematic approach is the key to quickly converging on the scope of the problem.
3. If you couldn’t find where the problem is, dig wider. If you found something strange that can’t be easily explained, dig deeper.
4. First-hand raw data is essential to determining the root cause.
5. Protocol, protocol, protocol. It must be implemented and tested thoroughly. An incomplete implementation could result in unexpected problems.
6. In a production environment, the top priority of troubleshooting is to mitigate the impact instead of fix the problem.
Well begun is half done.
Originally published at www.ebayinc.com on March 14, 2019.