What made SLOs so messy (and what we can do about it)

Ben Sigelman
Dec 7, 2021 · 4 min read

Originally posted as this twitter thread.

0/ In recent years, SLOs have graduated from being an “SRE 201” advanced topic to an outright buzzword. But despite their promise, SLO deployments today are messy and often unsuccessful. Why? And what can we do about it?

Thread: 👇

1/ All good engineers care about their users. In the olden days of monolithic software apps, engineers even got to deploy software that touched those users directly!

But given the depth of modern architectures, the user is often many, many hops away from that (good) engineer. 😢

2/ At the tippity-top of the stack, users are fiddling with their phones or clicking on websites, etc. In cloud-native apps, these activities create transactions that propagate from service to service, consuming resources along the way.

(More detail here: “Resources” and “Transactions”: a fundamental duality in observability)

3/ As a transaction descends deep into the service hierarchy, the user’s actual intent is obfuscated: we have the local service-to-service requests, but the context about user intent and experience is lost (at least without distributed tracing).

4/ The connection between engineers and users is indirect. Namely:

  • engineers control resources…
  • that power services…
  • that process requests…
  • that make up transactions…
  • that serve users.

SLOs are exciting because they offer a more direct link between eng and users.

5/ SLOs — “Service-Level Objectives” — are essentially goals about the behavior of a specific set of transactions scoped to a specific set of resources.

In this way, SLOs help engineers verify that they are doing their part to create a satisfying end-user experience.

6/ SLOs are nested, and as we unwind the SLO stack we eventually end up back at the end-user. If all the SLOs are met, we know that the user is having a reliable if not downright delightful experience. Nice!

Or that’s the theory, anyway…

7/ In practice, it rarely works out this way. In most organizations, that’s because SLIs and SLOs are determined locally, not globally.

SLOs should roll up to ensure reliable global transactions, but nothing validates that SLOs across the stack are even mutually consistent!

8/ This leads to SLOs that feel arbitrary because they are arbitrary. Setting and maintaining targets in the absence of a dynamic and global analysis is somewhere between difficult and impossible. So SLIs are chosen poorly, SLOs are SWAGs, and/or SLOs get padded.

9/ In some cases, teams enjoy the optimization work and set needlessly aggressive targets. At Google — at least before Dapper — I saw teams spend multiple engineer-quarters making performance improvements that weren’t even on the critical path! A literal waste of time.

10/ The end result of all of the above is an erosion of organizational trust in SLOs. In some of the more traffic cases, this has led eng orgs to roll back entire SLO programs. :-(

11/ What we need are SLIs and SLOs that are informed by the global requirements of the application and its users. With observability built upon tracing, we can automate this process, guiding engineers towards local SLIs and SLOs that fulfill the global promise to the user.

12/ Ultimately, SLOs can be transformative: they allow engineers to take accountability for the end-user experience no matter how deep and distributed the application may be.

But effective SLOs must be globally informed, and that’s only possible with great observability.

PS: I’m thinking about this topic because we at @LightstepHQ are running a private beta for some next-gen SLO functionality a la this thread — reply here or DM me if you’d like to participate and/or provide feedback on our designs.

PPS: For lots more detail and rigor about SLOs, I highly recommend Alex Hidalgo’s book on the subject!

For more threads like this one, please follow me here on Medium or as el_bhs on twitter!

LightstepHQ

Lightstep enables teams to detect and resolve regressions quickly, regardless of system scale.

LightstepHQ

Lightstep delivers unified observability, with visibility across multi-layered architectures, enabling teams to detect and resolve regressions quickly, regardless of system scale or complexity.

Ben Sigelman

Written by

Co-founder and CEO at LightStep, Co-creator of @OpenTelemetry and @OpenTracing, built Dapper (Google’s tracing system).

LightstepHQ

Lightstep delivers unified observability, with visibility across multi-layered architectures, enabling teams to detect and resolve regressions quickly, regardless of system scale or complexity.