Distributed tracing tracks actions (transactions) as they travel throughout a system and across multiple subsystems. Distributed tracing provides high cardinality observability which allows for performance tuning and failure analysis. Due to its ability to dynamically generate real time data, distributed tracing can have a profound impact on engineering organizations. This impact comes from the centralization and democratization of information and provides real time, living documentation, architectural knowledge. Distribute tracing provides an information layer that removes silos from teams, and is able to drill into individual client’s experiences. This post will look at a general journey through onboarding, developing and operating a service and how distributed tracing positively affects each of these stages. This post assumes familiarity with the distributed tracing (opentracing) data model.
Distributed tracing benefits onboarding by dynamically generating a real time view of the software system. For many organizations the source of system truth is the source code. Tracing provides a centralized view into the system as it is functioning, including all major services, components, protocols and requests.
Engineer onboarding often involves learning one more more services in depth. This involves learning where services sit in relation to each other and understanding specific service transactions. Hopefully there is some documentation available for the service topology and system architecture. Onboarding is focused on forming a mental model of the system, transactions and dependencies. Traditional documentation has severe limitations; namely keeping it up to date is usually a manual process and therefore time consuming.
Static Documentation is dead documentation because it is difficult to establish a feedback loop which enforces that it remains up to date. Because of this documentation is often best effort, and quickly drifts out of date. Tracing on the other hand is captured from actual running systems and is dynamically kept up to date:
The image above shows traditional documentation on the left. Documentation is a best effort point in time description of the system because it is missing a dynamic feedback loop. Tracing on the other hand (on the right) has this feedback loop ensuring that at any given moment it is a faithful representation of the system.
Since distributed tracing stores information as a graph (DAG) it becomes possible to dynamically generate service topologies:
The image above is from Trace by RisingStack but most tracing platforms offer this functionality. With a traditional non-tracing approach the source code would have to be analyzed, or an engineer familiar with the system would have to be consulted. Tracing is able to dynamically build a topology graph by inspecting transactions to see which services are involved in communication and which direction the dependencies go.
Another key part of engineer onboarding is learning the transactions and flows that support client requests. This involves understanding entry points, protocols, collaborators and upstream services. In traditional onboarding this time intensive and requires developing a thorough understanding of software in order to mitigate risk and understanding performance and infrastructure implications of changes. Once again this is a largely manual process based on documentation, test suites and hands on learning.
Once again all of this information is available dynamically in a centralized place using tracing:
Without this information getting this info would require tribal knowledge, understanding, or potentially searching repos for where & when dependencies are called within the context of a transaction. The worst case (which I find to be the normal case) is to find this information out from the source of truth, by grepping the source code. With tracing, not only is all of this information available but its also possible to narrow in on an individual client’s experienced (using tracing tags), something that is much more difficult to develop and understanding of by analyzing source code and tests.
Largest benefits development are derived from tracings ability to centralize information:
The image above shows a non tracing environment on the left, each software service stands alone, and its up to engineers to form mental models of how they interact (and potentially encode those models in tests). In distributed tracing, the real-life interactions are captured and recorded. This has profound impact on developing mental models and inventorying the state of the system.
Without a true representation of the system engineers are forced to establish their own mental models. Engineers with better mental models end up being disproportionately effective at system understanding then engineers with poorer mental models. This often manifests as a strict dependency on these experienced engineers during incidents or code reviews, resulting in centralized knowledge and long feedback loops (shown on the left of the image below):
Distributed tracing, on the other hand, provides an up to date representation of the system. It is a high fidelity representation of the system and allows for a base system understanding. Information about the state of the system, collaborators and protocols are available. Distributed tracing provides an accurate base system model, which democratizes information and enables accessibility to system understanding.
Have you ever had to update a shared dependency, or change an API interface without fully understanding which services were depending on it? Or had to audit a system to see which versions of a dependency services were using? Traditionally this information might be available from a metadata store, and if not would have to be queried from source code. Distributed tracing makes this information accessible through tags. For example consider the case where a team wants to audit which versions of a client library are being used:
client.version would be available for audit. This benefit is not unique to tracing, but is a property of having information centralized and queryable.
The largest benefit to operation comes from tracing’s ability to provide context through centralizing information. Cindy Sridharan recently had a number of suggestions on how tracing can be leveraged to shorten feedback loops within the context of incident response.
In traditional (non-tracing) environments, long feedback loops exist at the team level (tribal knowledge) and software level (source of truth). Incidents may involved many engineers from many teams just to establish a base understanding of what’s going on.
Contrast this with tracing which contains a cross service and cross team representation of the system. Any team is able to quickly gain a base context on all services without having to coordinate across teams. Distributed tracing provides a living system representation instead of engineers mental models or rooting through the source code. If the chart on the left above looks complicated, it’s because it is.
Think of the last time you were involved in a cross team incident; What was involved to gain a system understanding? Multiple service dashboards, multiple engineers from different services, everyone working together to develop a cross service view of the system and get context around the issue? Distributed tracing provides a base view of the system that can drastically shorten these feedback loops and number of people involved incidents. Tracing provides much more context, allowing an oncall responder to better understand the system and get farther on their own before involving more people.
Having centralized system knowledge provides tracing with a lot of leverage to identify anomaly and dig into their root causes. One of my favorite features around this is Lighstep’s “correlations”. Correlations are an amazing step in providing a default hypothesis for incident response:
While this is still in its infancy having operational data centralized this data opens up possibilities for anomaly detection, correlation, and other advanced automated analysis techniques to shorten debugging. While anomaly detection techniques have been offered by monitoring like DD and New relic they are often focused on single dimension time series. Having centralized store to store all operational data provides access to much more context and significantly increases the impact of anomaly detection .
Because distributed provides up to date, dynamic “living documentation” which democratizes information within an organization. This provides significant benefits in terms of on boarding documentation, centralized information about a system, and context. Furthermore modeling systems as graphs allow for accurately modeling system structures. This post explored the benefits of tracing on a software organization but largely ignores the costs (which are very real). In my single experience rolling out tracing, the costs are absolutely worth it for the reasons outlined above.