DevOps, Observability, and the need to tear down organizational boundaries
Originally posted as this twitter thread.
0/ Like most organizational innovations, DevOps is powerful because it allows people to be more productive by crossing fewer org boundaries.
The first set of boundaries are the org chart itself.
The second set of boundaries are less obvious, yet just as important.
1/ So, about that first set of org boundaries: by segmenting Dev and Ops roles, then by taking things a step further and creating separate orgs for each, we guaranteed that software deployment would always require human beings to wait for each other. Not good.
2/ And it’s worse than that… Since goals and incentives naturally align to the org chart, the healthy tension between velocity and reliability is lost at the team level, and what should feel like “a productive push and pull” can instead turn into a corporate political battle.
3/ Anecdote: at Google, the leader in their “parallel” SRE org told me that Monarch must have “100% availability.” When I asked him to be more realistic and “specify how many 9’s,” he just paused and repeated: “One Hundred Percent.”
Truly a portrait of unhealthy org tension.
4/ So, yes: by integrating aspects of Dev and Ops and by tearing down artificial org boundaries, DevOps engineers can automate everything, deploy continuously, and deploy independently. Right?
Actually, not right: and there’s our second set of org boundaries!
5/ There’s this stubborn myth that DevOps teams operate independently of each other. It’s as if we forget that the services they develop and operate are *part of a larger whole*, and that — since the *services* in an app depend on each other — the DevOps teams do, too.
6/ A few years ago, we at LightstepHQ hosted the (wonderful) Vijay Gill for a talk where he explained how, inevitably, you “ship your org chart.” So, when our diagnoses traverse system boundaries, they’re also traversing org boundaries.
You can’t hide from Conway’s Law. ⚖️
7/ Anyway, about 70–80% of all unplanned changes (read: “production incidents”) are actually due to a planned change elsewhere in the system! And so, in this way, the DevOps teams on both sides of this incident are blocked on each other — how can we fix that?
8/ Unfortunately this one requires even more than merging job descriptions or refactoring the org chart: this time we need technology to help address the communications issue.
In particular, we need observability to hop across org boundaries and unblock the human beings.
9/ For example…
When Team A makes a planned change to Service X, they created an “unplanned change” (i.e., “an incident”) in Service Y, which wakes up Team B.
Observability can and should tell Team A and Team B what’s just happened, providing evidence along the way.
10/ Once all parties understand what happened and who’s responsible, mitigating the issue is the easy part. The inter-team discovery and communications is “the second boundary” impeding productivity, and another reason why observability must play a central role in DevOps.
Follow me (el_bhs) on twitter for more threads like this one.