LightstepHQ
Published in

LightstepHQ

DevOps, Observability, and the need to tear down organizational boundaries

Originally posted as this twitter thread.

0/ Like most organizational innovations, DevOps is powerful because it allows people to be more productive by crossing fewer org boundaries.

The first set of boundaries are the org chart itself.

The second set of boundaries are less obvious, yet just as important.

1/ So, about that first set of org boundaries: by segmenting Dev and Ops roles, then by taking things a step further and creating separate orgs for each, we guaranteed that software deployment would always require human beings to wait for each other. Not good.

2/ And it’s worse than that… Since goals and incentives naturally align to the org chart, the healthy tension between velocity and reliability is lost at the team level, and what should feel like “a productive push and pull” can instead turn into a corporate political battle.

3/ Anecdote: at Google, the leader in their “parallel” SRE org told me that Monarch must have “100% availability.” When I asked him to be more realistic and “specify how many 9’s,” he just paused and repeated: “One Hundred Percent.”

Truly a portrait of unhealthy org tension.

4/ So, yes: by integrating aspects of Dev and Ops and by tearing down artificial org boundaries, DevOps engineers can automate everything, deploy continuously, and deploy independently. Right?

Actually, not right: and there’s our second set of org boundaries!

5/ There’s this stubborn myth that DevOps teams operate independently of each other. It’s as if we forget that the services they develop and operate are *part of a larger whole*, and that — since the *services* in an app depend on each other — the DevOps teams do, too.

6/ A few years ago, we at LightstepHQ hosted the (wonderful) Vijay Gill for a talk where he explained how, inevitably, you “ship your org chart.” So, when our diagnoses traverse system boundaries, they’re also traversing org boundaries.

You can’t hide from Conway’s Law. ⚖️

7/ Anyway, about 70–80% of all unplanned changes (read: “production incidents”) are actually due to a planned change elsewhere in the system! And so, in this way, the DevOps teams on both sides of this incident are blocked on each other — how can we fix that?

8/ Unfortunately this one requires even more than merging job descriptions or refactoring the org chart: this time we need technology to help address the communications issue.

In particular, we need observability to hop across org boundaries and unblock the human beings.

9/ For example…

When Team A makes a planned change to Service X, they created an “unplanned change” (i.e., “an incident”) in Service Y, which wakes up Team B.

Observability can and should tell Team A and Team B what’s just happened, providing evidence along the way.

10/ Once all parties understand what happened and who’s responsible, mitigating the issue is the easy part. The inter-team discovery and communications is “the second boundary” impeding productivity, and another reason why observability must play a central role in DevOps.

Follow me (el_bhs) on twitter for more threads like this one.

--

--

--

Lightstep delivers unified observability, with visibility across multi-layered architectures, enabling teams to detect and resolve regressions quickly, regardless of system scale or complexity.

Recommended from Medium

Forwarding Environment Variables in the Cloud

Identity & Access Request Workflows using Jira

From Monolith to Service Mesh, via a Front Proxy — Learnings from stories of building the Envoy…

How Your CTO Builds A Software Development Process from Scratch

Refining the Scan-to-BIM Workflow for Further Automation and Visualization

SRE: Debugging: Simple Memory Leaks in Go

System Design of Google Auto-Suggestion Service.

google search autosuggestion system design

Future of .NET (.NET 5?)- Microsoft Build 2019 from a .NET Developer Point of View

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Ben Sigelman

Ben Sigelman

Co-founder and CEO at LightStep, Co-creator of @OpenTelemetry and @OpenTracing, built Dapper (Google’s tracing system).

More from Medium

SRE Revisited: SLO in the Age of Microservices

A day in the life of an SRE: updating a production-critical Redis cluster

Cloud Foundry Advisory Board Call, Jan 2022: Feedback Around Buildpacks

Unpacking Observability: The Paradigm Shift from APM to Observability

Mural artwork featuring a black blob person with the thought bubble, “If we saved them…would they notice?” while UFOs attack a city.