By Charles Li
One of the frequently asked questions from new site reliability engineers is: Where to begin when troubleshooting a problem in a cloud environment? I always tell them: You should begin with understanding the problem. Let me demonstrate the reasons and methods with a real troubleshooting case.
There are many applications behind www.ebay.com. Each application serves a unique subset of the URLs, such as /a/* and /b/* in following example. The HTTP requests are distributed to the applications by layer 7 policies on the load balancer:
Policy A: if request = http://www.ebay.com/a/*, then send traffic to application-a