Let service teams own the service operations instead of the SRE

Manish Maheshwari
4 min readJan 1, 2022

--

Let service teams own the service operations instead of the SRE

Subscribe to the Modern Devops newsletter here

In ‘You build it, you run it’ model, development team owns the operations and service management. Organizations such as Amazon and Netflix operate on this model, in which developers owning operational responsibilities has greatly enhanced the service availability for customers, positively impacting the business.

SRE responsibilities

The SRE team is responsible for the availability, latency, performance, efficiency, change management, monitoring, emergency response, and capacity planning of their service(s).

Who owns availability

In many organizations, the availability is co-owned by Product and SRE teams. These two teams interface with each other using Service Level Objectives (SLOs) and Error budgets. The business or product team defines the availability targets, but the availability targets are co-owned by SREs. To ensure that product meets availability targets, it is essential that the availability is owned only by the Product team as they have the highest incentive to keep the product reliable. SREs can only help Product team to meet their availability targets, they help the team with reliability challenges.

Traditional SRE model does not scale

Traditionally, organizations had SysAdmins responsible for deploying, scaling and maintain servers in a datacenter or in the cloud. As teams modernized their tech stack, we then moved to Devops model in which Site Reliability Engineer (SRE) would own the responsibility of day-to-day service operations and availability monitoring of code which they have not written. As teams and business grows, you see the problem with this centralized model. SRE team then needs to prioritize critical services and would still be understaffed.

On-call ownership

Many organizations operates on a model in which SREs are responsible for managing production incidents. SREs are on-call 24x7 on behalf of the service team. SRE talent is best used to enable service teams engineer and operate a reliable system. Expecting SRE to be on-call is like having a guard on duty, demotivating them which then increases the SRE team churn.

Several organizations now have an incident response team that owns the on-call for the key products. The Incident response team is responsible to orchestrate the incident response and post-mortems. Service teams have their own on-call rotations in which developers own the on-call responsibilities for their micro-service(s). In these organizations, SREs are responsible to enable all service teams to build a reliable service.

Principles for a new SRE model

  • Automation over manual efforts. Build tools to reduce toil and automate all repetitive tasks.
  • Consistent processes and procedures. All services operate using uniform operational standards and quality.
  • Codifying best practices. The operating best practices can easily be shared and adopted by all development teams as part of their service design.

Speed is your biggest advantage

Speed is of essence in technology product development . The most critical step to move fast is to have cross-functional product teams. The product development team cannot operate swiftly if its dependent on separate functional teams — Test, SysAdmins, SRE etc. Your product teams must have those roles and skills. One of the most frustrating aspects of application release is waiting on external teams to complete tasks that are blocking you. Cross-functional teams eliminate this bottleneck.

Solution — Hybrid SRE organization

I have seen a few companies such as Airbnb having a hybrid SRE organization comprising of two teams — Central and Embedded SREs, a hub-and-spoke model. SREs do not own the availability goals for a service team, they do not carry the page on behalf of the service team, instead embedded SREs take on-call rotation turns with the developers on the team.

Central SRE team’s goal is to define company wide consistent processes and procedures for service operations, build tooling for developers for the defined processes and procedures and provide consulting as needed. Embedded SREs embed into a service team like a consultant and their goal is to ensure that the service team can successfully adopt the processes and procedures and leverage supporting tools. Embedded SREs own a program having clear entry and exit criteria. You can think about them as Solution Architects who can help an organization adopt a specific technology to improve reliability.

Although it appears that it’s a centralized organization in which decision making can be slow, but the day to day decisions are taken by the service teams themselves. Embedded SREs are focused on service specific reliability challenges.

Conclusion

SREs are best leveraged to build a highly reliable technology product. Hybrid SRE organizations can help scale SRE, enabling all development teams adopt best practices. A centralized incident response team can own on-call. This strategy can significantly improve the availability of your technology products.

Disclaimer: All opinions in this post are mine and in no-way affiliated with any organization including Airbnb.

--

--