SRE and Product

Jamie Allen
Site Reliability Engineering Leadership
7 min readJul 10, 2022

A post about the importance of product in SRE, and how to make product and SRE first-class citizens in your Software Development Lifecycle.

As Site Reliability Engineers and leaders, it’s easy to get bogged down in the details of what we’re doing — are we releasing fast enough, are we resilient enough, are we prepared for the next inorganic burst of traffic, etc. What is interesting about each of those questions is that they are product-focused. SRE itself is a product concern, and it requires that we partner cross-functionally with product leadership to define success and make decisions about how reliable the system should be.

I’ve previously worked as a leader in the “Reactive” systems world (I was global head of training and consulting for the company that made Scala and Akka), where we focused on building responsive systems through message-driven architectures that supported resilience and elasticity (through fan-out). At the time, I posited that while these concerns were mostly non-functional, they were product-driven as well. It is a product decision to choose the best architecture to support the requirements of the system.

SRE is similar. Product leadership needs to help us define our SLOs and our behaviors if and when we break our error budgets (do we always pivot to focus on system resilience, or are there times when we continue to push features despite the lower-than-acceptable resilience?), or what to do when we DON’T break our error budgets (do we take ourselves out in production to see the downstream impacts, do we push more features, or should we consider adjusting our SLO?).

It is a given in the SRE world that SREs will partner with Software Engineers (SWEs) to deliver an application, service, or experience together. Every project team requires a triumvirate representing the concerns of Product (what is success), Software Engineering (how we will build it), and Site Reliability (how we will measure and improve experiences) leadership.

The Product-Focused SDLC

I’m going to digress a bit about SRE, but I will come back to why it’s relevant later. I’ve previously had the good fortune to create the Software Development Lifecycle (SDLC) for a large enterprise organization (Fortune 125), with the mandate to “act as a startup” and not to allow existing behaviors/processes impact what we did — to think outside the box, as it were. While it was not my intention at first, I ended up building an auditable product-focused process for software delivery that made every activity product-first.

I was fortunate that the organization had an excellent product team to begin with. I was able to partner with product executives to drive our roadmap and define feature dependencies. We used Event Storming and Domain-Driven Design to define the workflows, ubiquitious language, and conceptual architecture across bounded contexts (giving us clearly delineated micro-services). Engineers and architects used that to define the physical, cloud-based architecture to implement, while product owners and business analysts wrote Epics and User Stories. User Stories were written in Gherkin to be executed as Behavior-Driven Design integration tests, and we built a custom platform for executing them, allowing us to maintain guaranteed consistency between our User Stories and what we tested — if a Story changed, the test was automatically broken until we met the new specification. Business Analysts would sign off that a User Story was delivered, and Product Owners would sign off on the delivery of Epics. And as we pushed new features and capabilities, we always knew immediately when we broke existing functionality — our integration tests were regressions as well. When we went live for the first time, we had very high confidence that we had delivered exactly what was required, and it was the most successful delivery of my career.

What is more interesting is that I did not embed Scrum Masters or agile leaders into the teams. Initially, I thought I would have liked to, but we needed to get to work and we didn’t have any around. As we evolved, I came to realize that product leaders began to run the teams directly. Instead of having Scrum Masters lead daily stand-ups, track burn-down, measure velocity, etc. and report that to product and engineering leadership, we had product leaders and engineering team leads partner to drive the teams’ activities and their progress.

This led to some interesting dynamics. When a User Story was to be implemented by an engineering team, they may switch out Business Analysts (BA) that they work with during that sprint. The engineering team has meetings before the sprint to discuss what User Stories are coming to them in the backlog for upcoming sprints, introducing them to the BA they’d be working with and giving them visibility into future work.

I’m sure there are good Scrum/Agile leaders who will not appreciate this approach, that they bring value to software delivery. We were able to successfully deliver amazing new features and experiences without them. And I think a primary reason is that we made product a first-class citizen of software development, not an external concern. And because product leaders were so invested in delivery and close to the engineering teams, they were always engaged and excited about the work we delivered together. I’ve also worked in organizations where Backlog Grooming was perfunctory at best, where product leaders rarely showed up and made the engineers they supposedly partnered with feel like they weren’t valued by the organization. This kind of dysfunction leads to attrition.

Product-Driven Development

In this model, we see the following responsibilities:

  • Technical Product Manager and Enterprise Architect are the outward-facing leaders of the effort, ensuring that communication with other external leaders is constantly taking place.
  • The Product Owners and Solution Architects (there are more than one of each, depending on the scope of the effort) are inward-facing, working at the Epic level to ensure that we understand how subsystems will work together to deliver customer experiences.
  • The Business Analysts and Engineering Team Leads are responsible for implementing the User Stories in accordance with the Product Owners and Solutions Architects they partner with.

This team and organization structure still exists to this day, and drives some of the most well-known and respected global customer digital experiences in the world.

Layering in SRE

You might be asking what all of this has to do with SRE. At the time, I hadn’t yet transitioned my career to become infrastructure and SRE focused. As a team, we did set up dashboards, create runbooks, and own our deliveries in production via on-call rotations. But we weren’t thinking holisitically about resilience as a first-class citizen in our SDLC, even though we were focused on building “Reactive” implementations.

If I had to do this all over again, I would have added a third column to the product delivery teams to focus on SRE. The idea is to make resilience, like product itself, a first class citizen in the way organizations deliver software.

Product-Driven Development with SRE

SREs would be included in the development team level, working to define and provide historical visibility to SLIs/SLOs/SLAs, tracking MTTF/MTTR, reducing toil, managing capacity, and helping with incident management and preparation. The SRE Lead would ensure that non-functional requirements of customer experiences at the Epic level are being appropriately met, and partnering with senior product leadership to address any issues. The SRE Organization Lead would own SRE across the strategic project effort, making sure that MTTF and MTTR are tracking in the right direction, and making high-level product-driven decisions about what to do for Error Budgets, scheduling Game Day activities, deciding whether or not to adopt Chaos Engineering, ensuring communication between SRE teams addressing toil is allowing for reuse of solutions, etc.

Who Decides When To “Chase Resilience?”

When we’re deciding on SLOs for our systems, we should have historical data that helps us define the achievable value for that SLO. The Product Owner works with the Software Engineers and SREs to set that value — as an example, maybe 99.99% of all requests should return without an error (either a 500 server error or the wrong data). That means that only 1 request in 10,000 should return an error.

What would be the cost and benefit to the business of chasing 5 9s (99.999%, or 1 error in every 100,000 requests)? Is it something that could be achieved rather easily, maybe because many of the errors are caused by concurrent connection issues in a piece of pluggable infrastructure? Or would it require a complete redesign and reimplementation of the service to use a different database solution? A product owner is responsible for working with us to figure out whether or not chasing additional 9s has enough value to the business to make it worthwhile.

We Cannot Have SRE Without Product Input

It is easy to say for some infrastructure or platform teams that there is no product leadership. That is short-sighted, in my opinion. Just because a team is providing an infrastructure capability, such as containerization and orchestration as a service to the rest of the organization, you should still have product owners to help you define what success for your users looks like, and how to design experiences that best meet your users’ needs. If you don’t have someone doing that, you are likely doing it yourself, or ignoring the definition of success. Neither is optimal.

A very famous programmer, someone many of you have heard of for the OSS projects they’ve created and the public speaking they’ve done, essentially became a product owner at another company. And they loved it. Their job was to figure out how to solve an organization-wide problem about how we deployed software at massive scale, and redesign how we would do so in the future. More of us should consider taking on technical product ownership roles to help solve the biggest problems our organizations face.

As SREs, make sure you partner with product to address everything that you do. You will be surprised how much product, and the definition of customer success that they provide, influence the work that you deliver.

--

--

Jamie Allen
Site Reliability Engineering Leadership

SRE CTO. Ex-Software engineering leader behind Starbucks Rewards and MOP. Ex-Facebook SRE leader.