SRE: Debugging: Strategies: Triangulation
Triangulation is a debugging strategy that uses multiple perspectives in order to achieve both a “one-to-many” and “client-server” partition of the debug space. Triangulation can help to quickly pinpoint the source of an issue and to identify which part of a logical transaction an error is originating from. This post explains what Triangulation is and how to use it when debugging a real world problem.
What is Triangulation?
Triangulation is the consulting on multiple perspectives in order to help us better understand what is true what is actually happening.
Triangulation originates from the physical world of surveyors, maritime navigators and military strategists. The process is used to determine the location of an unknown point by forming triangles to it from known points.
In software debugging Triangulation is using the perspective of multiple clients in order to isolate which side of a connection (client vs server) a problem is occurring, or among which subsets of clients. Consider a company that has a Saas offering with a single endpoint and user tier (c’mon just pretend). A client of that service is complaining about connectivity:
Where would you start debugging with only this information? Looking at service errors to see client requests might be the quickest place to start. The lack of errors might indicate that the issue is on the client side. Another option, which may have a longer feedback loop, would be to work with the client to see how they are making requests and which errors they are experiencing. A single source of information contains ambiguity which side of the connection errors are originating from. Next consider that another client, from a different location, who begins complaining about issues:
Finally, how much stronger indicator is it if 3 clients were consulted and each is reporting issues:
When each client (perspective) that is consulted helps to reinforce the assumption that the issue is originating from the Company Service (3separate clients from separate locations are reporting issues with the service). Inversely, the signal is just as strong during the absence of errors:
Looking at the graph above it is very likely that the issue is on the Client 1’s end since no other clients are experiencing issues. Triangulation in the above examples occurs with respect to the debugger’s perspective. In the case of debugging the state of the system and origination of issues is the unknown point, and the debugger is using known entities ie the edge between two entities to help better understand the state of the system (more on this later).
Example Scenario
In order to illustrate how to apply triangulation in the real world we’ll use a non-computer scenario to illustrate the universality of the approach. Triangulation is also one of the core problem and decision making Principles of Ray Dalio it’s useful in all debugging and decision making scenarios not just software!(synopsis here).
Pretend that you’re hooking up a smartphone to an external speaker through a usb cable.
The above chart also establishes the direction (DAG) of the transaction from the phone to the speaker. After hooking up the phone no sound is coming out of the speaker, but just yesterday the all 3 components were working just fine. Before we give up, throw everything away and buy 3 new components, lets see if we could reduce the cost by employing triangulation. Triangulation works by selecting a component (in bold below) and swapping it out (triangulate on multiple perspectives).
- Phone: (USB CABLE, EXTERNAL SPEAKER)
- Cable: (PHONE, EXTERNAL SPEAKER)
- Speaker: (PHONE, USB CABLE)
With physical components, the chances of getting 2 faulty components are sufficiently small so we will only triangulate against 2 perspectives.
We’ll use the the “cheapest component first” heuristic and start with the USB Cable:
Cable: (PHONE, EXTERNAL SPEAKER):
At this point the best case scenario is that phone now plays and we’re only out a couple of dollars for the broken USB CABLE. Our first round of triangulation has given us a reasonable amount of confidence that our USB CABLES are working fine (and we can gain further confidence that they are functioning by plugging the (PHONE, USB CABLE) into a different source (ie a charger, computer, etc)).
Speaker: (PHONE, USB CABLE):
If this works then we can be reasonable sure that the speaker was the problem. If no music is still playing the final step would be to triangulate on the phone:
Phone: (USB CABLE, EXTERNAL SPEAKER)
NOTE: In my experiences I’ve found it extremely likely that the source of an issue is a single component. In the case of multiple components failing or the case of 2 faulty sources the above debugging methodology would not be sufficient and would require more permutations.
Heuristics
Number of Sources
For software I’ve found that triangulating with 3 sources, from at least 2 network locations, is a good number. If all are in the same network and a network partition occurred all 3 could be experiencing issues suggesting that it is a server problem. 1 is too biased, 2 doesn’t allow for partitions, but 3 is a good indicator that there is an issue.
Conclusion
Triangulation is a common debugging strategy. If you’ve ever swapped out a cable when a phone or computer hasn’t worked than you’ve triangulated! Triangulation is so powerful it is useful in many different situations. I have found triangulation to be the first debugging strategy I use when there is an SRE problem. Whenever I come to a situation where a service or client is reporting a problem I begin by asking: “are other services/clients experiencing this problem as well?”.