Making Continuous Delivery “Global” via Observability

Ben Sigelman
LightstepHQ
Published in
4 min readApr 7, 2021

--

Originally posted as this twitter thread.

0/ The easier part of Continuous Delivery (“CD”) is, well, “continuously delivering software.”

The harder part is doing it reliably.

This is a thread about the critical differences between what we’ll refer to as “local CD” and “global CD,” and how observability fits in. 👇

1/ Let’s begin by restating the conventional wisdom about how to do “Continuous Delivery” for a single (micro)service:

i) <CD run starts>

ii) Qualify release in pre-prod

iii) Deploy to prod

iv) If the deployed service is unstable, roll back the deploy

Safe, right? Not really.

2/ The above is what I would term “Local CD” — we are checking whether *the deployed service itself* is (locally) stable… but that’s it.

The problem is that a *majority* of production incidents are due to planned deployments in *other* services. “Local CD” cannot find those.

3/ So what does “global CD” look like?

i) <CD run starts>

ii) Qualify release in pre-prod

iii) Deploy to prod

iv) If the deployed service is unstable, roll back (Local check)

v) If the deployed service *causes instability elsewhere*, roll back (Global check)

4/ Seems simple and sensible enough — so why aren’t we all doing it already?

… Well, because most orgs cannot automatically determine if a service deployment is *the reason* that some other service becomes unhealthy.

Why is this hard?

5/ For “Global CD,” we need to be able to understand the (causal) relationship between a *planned* change (the deployment in question) and potential *unplanned* changes, be they downstream or upstream.

It looks like this:

6/ The first picture is just the *Baseline.*

This is how your system normally looks… or at least how it looked prior to the CD run in question.

7/ Baselines are *really* important: distributed systems do bizarre things All The Time, and human beings can’t reliably distinguish between an “old but benign bizarre behavior” and a “new and problematic bizarre behavior.”

… Such as, say, a bad deploy with global side-effects!

8/ Once we have a baseline, we can start looking at *changes* to that baseline: this is where this concept of “Global CD” really gets powerful (and interesting).

(And observability is about *explaining changes* — more here: https://twitter.com/el_bhs/status/1349406421226459136…)

9/ Whereas “Local CD” only detects regressions in the specific service that’s just been deployed, “Global CD” detects related regressions in services that may be far away in the service graph.

10/ In this diagram, we see how the transactions passing through “Version N+1” are notably slower, and how, since they’re on the critical path for Service A, they directly contribute to its overall latency.

11/ With observability that can separately segment “Service A” performance based on which *version* of “Service X” is involved, it’s easy to detect that this deployment would be disastrous for Service A’s latency.

And so: the deploy is blocked! Local change, global regression.

12/ To sum up, any organization that wants to ship software quickly needs to do CD.

But with 70% of production incidents boiling down to bad deploys, we need CD that leverages observability to act based on the global picture, not just the local picture.

PS/ This is not just theoretical!

If you want to play around with some related functionality, check out the @LightstepHQ Sandbox… it’s basically a collection of guided demos illustrating some important workflows.

Link: http://lightstep.com/sandbox

For more threads like this one, please follow me here on Medium or as el_bhs on twitter!

--

--

Ben Sigelman
LightstepHQ

Co-founder and CEO at LightStep, Co-creator of @OpenTelemetry and @OpenTracing, built Dapper (Google’s tracing system).