Signal, Noise, and a Service Diagram That Actually Answers Questions

LightstepHQ
LightstepHQ
Published in
4 min readFeb 12, 2019

February 12, 2019 | Karthik Kumar & Danton Rodriguez

Being on call can be frustrating. Getting paged for a service or operation you’ve never even heard of can be even worse. You don’t know where to start investigating, who to contact, or which services are even involved. Scouring through your favorite metrics, recent logs, and starred dashboards can quickly become overwhelming and counterproductive. Even looking at a full system topology of your application doesn’t help in such situations: you need tools that allow you to focus your analysis on the subset of your architecture that’s relevant to the actual problem you’re trying to solve.

In a microservices environment, the situation described above is unfortunately all too common. A single component in your application can have many dependencies and investigating each one soon becomes intractable, especially while firefighting. The system architecture changes continuously, and no single engineer is able to maintain an accurate, up-to-date view of all the service interdependencies. Relying solely on intuition or giant, sprawling, per-service dashboards is costly and counterproductive during software outages.

Introducing: Service Diagram

Today, we’re excited to introduce LightStep’s Service Diagram, a new tool to expedite root-cause analysis and reduce mean time to resolution. Service Diagram provides a visual, interactive, and hierarchical representation of a system’s behavior that is filtered, constructed, and annotated to shed light on user-specified performance questions, all in real time.

Traditional full-system maps are a static, high-level view of sampled system state with limited interaction capabilities. They can be overwhelming, distracting, and are usually irrelevant or even misleading during a root cause investigation.

LightStep’s Service Diagram is uniquely designed to guide the user towards components undergoing a performance regression. A user can start their investigation by searching for a specific service, filter by a latency range, high-cardinality tag or operation, and view an interactive diagram that clearly explains the latency bottleneck. By starting with a symptom (“Why is my service slow?”) and proceeding to a visual explanation (“One of the descendant services is returning errors”), LightStep provides the context the user needs to gain insights in their investigation.

LightStep’s novel Satellite Architecture is what makes this all possible: the focused analysis is constructed just-in-time from thousands of distributed traces assembled in response to the user’s query. This allows LightStep to suppress the distractions from unrelated services and components, reducing the noise and further amplifying the signals generated from the actual issue under investigation.

Visualize dependencies between services

Example of Service Diagram’s hierarchical layout and filtering capabilities

Service Diagram intelligently organizes the services in your system to match the flow of requests. It annotates nodes with important data, highlighting services that contribute to the latency of a transaction or services experiencing errors. Service Diagram helps you easily visualize complex system architecture, identify troublesome services, and narrow the search space of possible root causes.

In the example above, the user has queried for a single service, a tag for a single customer, and a specific latency region. Service Diagram builds a diagram to answer this very explicit question about application performance. It also clearly distinguishes between api-proxy, which is sending traffic to api-server, and the few services that are receiving traffic from api-server.

Identify services contributing to latency

Example of Service Diagram identifying problematic services

Service Diagram’s distinct design surfaces performance bottlenecks in complex, multi-layered architectures. In the example above, Service Diagram guides the user’s focus towards the two services highlighted in yellow. The size of the halos is proportional to the latency observed in each service, helping the user confidently narrow the scope of their investigation.

Service Diagram is fully interactive, enabling users to analyze system performance across any service, operation, or tag. Users can iteratively refine their query, select any interesting portion of the Latency Histogram, and view a diagram built from only the traces that match the given query and filter. This is especially useful when investigating issues impacting unfamiliar services or operations.

Formulate, validate, and propagate root cause hypotheses

Service Diagram is built from the traces captured in Snapshots, which means it can be captured and shared across an organization. This allows users to share both historical and real-time observations while triaging a performance issue. For organizations that have recently adopted microservices, Service Diagram can be an invaluable resource to document architectural and performance changes over time.

Service Diagram demystifies some of the complexity commonly associated with microservices. It aggregates the rich information available in tracing data and surfaces it in an easy-to-visualize way. We’ve received positive feedback from customers with early access to this feature, and we’re excited to see how Service Diagram will be used to make being on call a little less harrowing.

Are you adopting microservices? Do you relate a little too well with the on-call story above? Try [x]PM and see how we can help you.

Originally published at lightstep.com on February 12, 2019.

--

--

LightstepHQ
LightstepHQ

Lightstep enables teams to detect and resolve regressions quickly, regardless of system scale or complexity.