Scrum For SRE Teams

Hemendra Singh
3 min readJul 24, 2020

Introduction
Traditionally SRE (Site Reliability Engineering) oriented teams have used Kanban for working and getting their tasks done. But that causes fewer releases of tools, products during the Release cycle which they might have developed for internal consumption by other teams.

Kanban is good for operations, but remember SRE is not only about operations, it is also about treating operations as if it were a software problem.

Some of the goals of an SRE team can be stated as follows:

  1. Own Product Reliability, Availability.
  2. Carry out Performance Engineering for the Products.
  3. Ensure product SLOs and SLAs are met.
  4. Help the products achieve the required HA (High Availability) & DR (Disaster Recovery) expectations.
  5. By using Chaos Engineering make the products reliable under stress.
  6. Incident Management.
  7. Remove Toil from the existing Processes.
  8. Improve Product Observability.

SRE teams can adopt Scrum in 2 ways:

  1. Have one Scrum and one Kanban board: This approach suggests having 1 board as Scrum board which will be used for delivering product increments i.e. the products, tools, automation the team works on. The ideal sprint time is 2 weeks. The next board in this case would be a Kanban board which will be used to create On-Call tickets. On-Call tickets are created by SRE team members who are currently engaged in day to day operations of providing support to the products on boarded to the SRE team. Since SRE owns reliability, availability, etc. managing On-Call tickets is one of the key steps to ensure success. One of the demerit of this approach is lack of observability of what is going on with the On-Call tickets.
  2. Have only one Scrum board: This approach suggests having only one Scrum board for both development and operations. This increases observability in not only development work but also in operations work and the team tends to be more Agile and focused towards ticket closures and delivering a product increment. Since Kanban does not have any fixed closure dates like Scrum, it can decrease the capacity of the SRE team to handle On-Call tickets.

How to plan the SRE Sprint?
Planning the sprint carefully for an SRE team is indeed very essential as you need to take into account the On-Call part of the SRE and Ad-Hoc requests which are going to come your way during the sprint.

A part of capacity should be dedicated for On-Call and some should left for the Ad-Hoc tasks that other teams would request from you. The Ad-Hoc tasks can simply be architecture consultations, understanding how to use SRE tools, understanding dashboards, meetings for SLO agreement, High priority work, troubleshooting production issues, etc.

How to Scale SRE services and practices using Scrum?
In order to scale the services, more focus should be given in delivering SRE as a Services, by delivering product increments during the Scrum Sprint. The aim should be to identify the areas which require manual effort or human intervention and automate it by delivering tools to do that job instead of a SRE team member putting effort in it. For example, when the SRE On-Call member receives an alert about an incident with one of the product that SRE has on boarded, the first level of troubleshooting can be automated to check if the software/machine can handle the current incident and if any action is needed it can auto heal the system. If the software/machine cannot handle the given incident on its own only then it should communicate the same to the On-Call member. This ensures the SRE Team avoids toil and is more focused towards delivering productive tools which can make the systems more reliable than ever before.

Conclusion
SRE and Scrum go hand in hand in delivering value, eliminating toil, improving observability, delivering better reliability for products.

By adopting Scrum, SRE Teams can be more aligned with the core principle of SRE which is “Treating operations as if it were a software problem”.

--

--