Snapshots: Detailed System Behavior, Always Saved and Always Shareable

LightstepHQ
LightstepHQ
Published in
4 min readOct 18, 2018

October 18, 2018 | Talia Moyal, Dennis Chu

When you’re on-call and get paged, hopefully not at 2 am on a Saturday (but it happens), your immediate priorities are likely to assess the impact, mitigate the incident, restore affected services to operational states, and go back to bed — in that order and as quickly as possible. Chances are, you’re much less inclined to comprehensively chronicle the actions you took in that hectic, sleep-deprived moment for the eventual post-mortem or task hand-off to others.

Today, we’re excited to introduce Snapshots. Snapshots for LightStep [x]PM simplify the process of accurately describing and documenting the complex behaviors of distributed systems and facilitate more effective cross-team communication and collaboration.

Your new source of truth

Snapshots are automatically created whenever queries are made in Explorer (formerly Live View). They contain the complete results of a particular query at any given point in time. This information includes: a detailed latency histogram that characterizes different system behaviors for a specified service, operation, and/or tag values, historical layers that provide context relative to the Snapshot creation time, and hundreds of relevant example traces to help explain the symptoms observed.

Snapshots simplify the process of accurately describing and documenting complex system behaviors at a specific point in time.

Snapshots have unique URLs, so they can be easily referenced and shared with team members. You can also access a history of your Snapshots, with corresponding timestamps, in the Snapshots dropdown, so you can always revisit past debugging work even if you didn’t think to save it at the time. [x]PM durably persists all of the aggregate statistics and trace data used to derive these Snapshots, so whenever you examine a particular Snapshot, the data presented will remain constant no matter how much time has passed or how much your system behavior has changed since it was captured. Snapshots also allow you to filter the example traces, so you only see the ones that fall within a selected latency range.

Designed to streamline workflows and solve problems

During a production outage or similar crisis, it’s often difficult to precisely recount the investigative steps you took or articulate the exact symptoms observed, especially given the frantic nature of these incidents. When teams are left to speculate and piece together what really happened on their own, cross-functional communication and collaboration deteriorate — often making matters even worse.

Snapshots automatically “record as you go,” so you can focus solely on firefighting rather than recordkeeping. The detailed nature of these Snapshots also makes it simple to share with your teammates what you’ve already examined and what you’re currently seeing, so you can quickly bring them up to speed and parallelize your investigative efforts, ultimately reducing your MTTR. Snapshots can also be an invaluable resource for the post-mortems following these incidents because they let you revisit the actual system behavior at the time of the incident rather than rely on the recollections of individuals operating under a stressful situation.

Finally, because the scope of a Snapshot can be as coarse or fine-grained as necessary (e.g. every operation for a particular service, or only the traces that match a high-cardinality tag), they can be used to compare performance changes over time or validate correct behavior for any aspect of your system. When a significant code or service change leads to an unexpected result, you can immediately examine comparable traces from before and after the change to understand what needs to be improved or optimized.

Looking ahead

This feature release represents an important milestone for our team because Snapshots are the building blocks for several significant innovations we will be releasing in the coming months. Snapshots allow us to not only segment but also aggregate related trace data for a higher level of system analysis. This enables our platform to correlate and compare complex system behaviors in a truly meaningful way, and it allows us to provide valuable insights to reduce MTTR and accelerate root cause analysis that would otherwise be impossible to derive. Stay tuned for more to come!

Try the LightStep [x]PM demo to see for yourself how it can help you to manage your complex systems.

Originally published at lightstep.com on October 18, 2018.

--

--

LightstepHQ
LightstepHQ

Lightstep enables teams to detect and resolve regressions quickly, regardless of system scale or complexity.