Steel Threads — Turning Site Reliability Upside Down

Published in

Site Reliability Engineering Leadership

4 min readFeb 13, 2024

When most of us think about SRE, we think about hiring individual contributors who will help software engineers deliver more resilient applications and services. We find people with the right skills to make that happen. We embed them in the teams, and then we wait for the expected improvements to take place. Hire and hope, as I call it.

SRE should not be this banal. As leaders, we should be thinking about how we drive better outcomes. I do not care if someone’s service is available when a customer is not able to use it. Having a bunch of people in an incident standing around saying “I’m up” is frustrating to anyone trying to create a customer or user experience. It doesn’t matter that you’re up, it matters that the customer or user is not able to use your system. Imagine being a large airline and having nobody able to book a flight because an external dependency isn’t available, such as your external network provider. What would you do then? You’re up, but your capability is unavailable and it’s costing your business millions.

I had heard the term “Steel Threads” mentioned recently, and I found the concept compelling. A coworker with better Googling skills than me (thanks, Marina!) found the white paper by Narayanappa, Bae, Alkobaisi, and Debnath which discusses tracing an experience or capability through a complex architecture. Where it gets interesting from an SRE perspective is that the approach can be used to “Mitigate and provide mechanism for addressing technical risks associated with the architecture.”

This means that instead of having individual SREs work with service teams and hope that we somehow improve the overall reliability of the experiences we’re trying to create, we should also be focusing on tracking the resilience of the core experiences themselves. This takes the “I don’t care that your service is up, the customer can’t use our product” to its ultimate conclusion.

We need to turn SRE upside down. Yes, we still need the Bottoms-Up embedded SREs doing the necessary work to eliminate toil, create dashboards, etc. We could also use SRE concepts to create an operational view of a capability — how well we deliver that key customer experience. We need Top-Down SREs who work with product leaders and architects to map key flows of the most important business capabilities for your enterprise. Approaches like Event Storming are good for understanding conceptual flows, but you then must break the conceptual flow into physical architecture and internal/external dependency mappings. Once a flow is mapped and tagged in systems, collecting the metrics should be straightforward. Then you would be able to track how well that key capability is being delivered. You could visualize an outage and its impact on the capability and cost to the business. You could identify key areas in the flow that are least resilient or have the highest latency, and target investments to improve those capabilities. Applying an SRE lens to the way a business operates would be hugely powerful in making data-driven executive decisions.

For what it’s worth, this idea isn’t entirely new. A friend pointed out to me that the idea is very similar to a SIPOC diagram to document a business process in the Six Sigma world. Just documenting it isn’t enough. What I want to do is understand the operational performance of the capability from an SRE point of view. What is the current status of the key capabilities is the part where I’m attempting to innovate. The external network dependency could now be considered from a perspective of how often it causes the customer experience to be unavailable/degraded, and whether or not a redundant network solution is a worthwhile investment for the business.

I am concerned about one aspect of this effort. If we give this level of visibility into how well the business provides key customer capabilities, the data could easily be used by nafarious leaders looking to gain from twisting the information in their favor. The data should not be shared organizationally unless the culture supports it, because there’s too much incentive for bad people to use the data to do bad things. If your culture is not supportive and blameless from the top down, it’s best that only key stakeholders have visibility into the data for decision making.

We use SRE principles to help us understand how we can create more resilient services and systems. We should also use the same approach to understand how well we deliver the most important business capabilities in our organizations.

Steel Threads — Turning Site Reliability Upside Down

Written by Jamie Allen