Image for post
Image for post

Performance is a shape, not a number

Applications have evolved – again – and it’s time for performance analysis to follow suit

Ben Sigelman
May 8, 2018 · 6 min read

Phase 1: Bare Metal and average latency (~2002)

Image for post
Image for post
The stack (2002): a monolith running on a hand-patched server with a funny hostname in a datacenter you have to drive to yourself.
Image for post
Image for post
Recent average latency for an important internal microservice API call at LightStep

Phase 2: Cloud VMs and p99 latency (~2012)

Image for post
Image for post
The stack (2012): a monolith running in AWS with a few off-the-shelf services doing special-purpose heavy lifting (Solr, Redis, etc).
Image for post
Image for post
Recent p95 latency for the same important internal microservice API call at LightStep
Image for post
Image for post
Recent p99.9 latency for the same microservice API call. Now we see some instability.

Phase 3: Microservices and detailed latency histograms (2018)

Image for post
Image for post
The stack (2018): A few legacy holdovers (monoliths or otherwise) surrounded — and eventually replaced — by a growing constellation of orchestrated microservices.
  1. Triage to determine which modes we care about: consider both their performance (latency) and their prevalence
  2. Explain the behaviors that characterize these high-priority modes
Image for post
Image for post
A real-time view of latency for a particular API call in a particular microservice. We can clearly distinguish distinct modes (the “bumps”) in the distribution; if we want to restrict our analysis to traces from the slowest mode, we filter interactively.
Image for post
Image for post
Given 100% of the (unsampled) data, we can isolate and zoom in on any feature, no matter how small. Here the user restricts the analysis to project_id 22, then project_id 36 (which have completely different performance characteristics). The same can be done for any other tag, even those with high cardinality: experiment ids, release ids, and so on.
Image for post
Image for post
To attempt end-to-end root cause analysis, we need end-to-end transaction traces. Here we filter to outliers for project_id 36, choose a trace from a few seconds ago, and realize it took 109ms to acquire a mutex lock: our smoking gun.

Stepping back and looking forward…

As practitioners, we must recognize that countless disconnected timeseries statistics are not enough to explain the behavior of modern applications. While p99 latency can still be a useful statistic, the complexity of today’s microservice architectures warrants a richer and more flexible approach. Our tools must identify, triage, and explain latency issues, even as organizations adopt microservices.


LightStepHQ

Lightstep enables teams to detect and resolve regressions quickly, regardless of system scale.

Ben Sigelman

Written by

Co-founder and CEO at LightStep, Co-creator of @OpenTelemetry and @OpenTracing, built Dapper (Google’s tracing system).

LightStepHQ

Lightstep delivers unified observability, with visibility across multi-layered architectures, enabling teams to detect and resolve regressions quickly, regardless of system scale or complexity.

Ben Sigelman

Written by

Co-founder and CEO at LightStep, Co-creator of @OpenTelemetry and @OpenTracing, built Dapper (Google’s tracing system).

LightStepHQ

Lightstep delivers unified observability, with visibility across multi-layered architectures, enabling teams to detect and resolve regressions quickly, regardless of system scale or complexity.

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store