LightStep [x]PM Architecture Explained

LightstepHQ
LightstepHQ
Published in
5 min readJan 31, 2018

January 31, 2018 | Parker Edwards

LightStep [x]PM has made an incredible impact at some of the world’s most innovative companies. It provides an unprecedented level of visibility into the production performance of these highly-distributed applications. When we say unprecedented, we mean it — we analyze 100% of the performance data flowing through these enterprise systems. These analyses include a large number of customizable facets, and we provide real-time, end-to-end distributed traces, with no up-front sampling at all. Our users can see their applications and services in entirely new ways, which we’ll discuss a bit later. But first, it’s important to explore how we can analyze a near-limitless volume of data with no scaling, cardinality, or overhead concerns.

Measure everything, diagnose anything

Our unique data collection architecture allows us to collect and analyze the large volume of production data that our enterprise customers generate. LightStep was founded by pioneers in the distributed tracing space who realize end-to-end traces are the holy grail of performance data. These traces provide visibility into exactly how separate services and parts of an application interact with each other.

Furthermore, time-series data that represents latency, throughput, and error rate for operations, services, and entire transactions is necessary to enable Service Level Objective (SLO) and root cause analysis capabilities in any modern performance monitoring solution.

The gap in existing solutions can be seen in the granularity and availability of both distributed tracing and time-series data. Early in the design of [x]PM, our team realized that in order to reliably provide this data, we couldn’t do any heavy lifting on the application hosts. We wanted to provide a new way to measure and analyze time-series and trace data, and doing so with either set of data would be very expensive computationally. So instead, we built a new way of collecting and analyzing this data.

Granular timeseries and trace data can be collected from any facet of a distributed environment, at any scale

Our Satellite Architecture collects the performance data of individual operations in a service through the OpenTracing libraries. OpenTracing is a vendor-neutral API that defines both how we can measure performance in a system and piece it together with other related distributed operations. The OpenTracing libraries in conjunction with LightStep are extremely lightweight.

[x]PM was designed from the outset to have no measurable performance overhead, and LightStep as a company has a “first, do no harm” policy — performance transparency has been a first principles priority here from day one. As a result, 100% of our customers run LightStep 100% of the time in production. We make no attempts to do any kind of intensive processing on the app host, making overhead concerns a distant memory. Our Satellite Architecture can also use log translations to extract the necessary information from the system.

That data then flows to our satellites, which sit on premise within your hosted datacenter or cloud environment. These stand-alone satellites are a key component of our high performance stream processing system. The satellites store and analyze the performance data (extracted from your system) for about 3–5 minutes. This gives us enough time to index the entire volume of performance data across historical norms, user-defined thresholds, and other metrics.

For example, let’s say you’re measuring the performance of an authentication transaction. That transaction might rely on several services, each with its own datastore. Our satellites will receive the performance data of each operation, across every service, and automatically analyze the performance of each segment against historical performance, error rates, and throughput.

LightStep [x]PM segments performance data across two VIP customers, without front-end sampling or data-smoothing

But let’s say you want to analyze that performance data along a couple of dimensions. Maybe you want to track the performance of your authentication service by identity management provider or deployment version. Maybe you want to track SSO functionality versus traditional username and password logins. LightStep (and OpenTracing) model those dimensions as key/value pairs called “tags.” Once they’ve been added to LightStep [x]PM, our satellites will automatically partition that data out and provide the time-series and trace data for each segment without you having to worry about the cardinality of your performance data.

We take this one step further by offering the ability to set your own SLOs for each dimension. So now, you can track the performance of any arbitrary segment of your application, set different performance thresholds for each, and receive context-rich alerts — complete with moment-in-time traces relevant to those segments.

The operation started with a front-end user interaction but the “problem” was a long-running DB call over a hundred layers deep
See the entire request (and response) payload that was generated when the user first clicked “Get Historical Matches”

And this is only the beginning. Because we’re storing all of your performance data for that 3–5 minute window of omniscience, we can tell you not only when and where a performance issue occurs in your stack, but also everything that happens both downstream and upstream from that issue. So, if you’re trying to diagnose a 1 in a 1,000 database issue that is only affecting users of a certain deployment, starting with a client-side operation, we can tell you exactly what that environment looked like, and every call leading from the front-end to the datastore and back.

This is a glimpse of what is possible when you have 100% visibility into the performance of your production distributed system. With LightStep [x]PM, you get a fine-grained, objective view into your system, so you no longer have to react with partial information to the latest unexpected systems failures; instead, you can focus on building, improving, and tracking the core value your system was built to deliver.

Originally published at lightstep.com on January 31, 2018.

--

--

LightstepHQ
LightstepHQ

Lightstep enables teams to detect and resolve regressions quickly, regardless of system scale or complexity.