Shared On-Call Is Where the SRE Magic Happens
TL;DR — Software engineers and SREs should share a single on-call rotation as part of a single team, as this is where empathy for each other is built.
Organizations adopt SRE in ways that suit them best, and that makes complete sense to me. Not every company is Google or Facebook, and it’s important to tailor your implementation to the specific needs of your organization. However, there is one implementation detail that I think is very important and should not be overlooked.
Some organizations have adopted an SRE model where SREs are responsible for the entire on-call rotation, often handling some combination of L1-L2 support to triage issues and handing off L3 to software engineers (SWEs). Or, in other cases, SREs are not involved in application on-call rotations and work at a platform level below the SWEs, and they detect platform-level issues and address them on behalf of everyone leveraging it.
While there are varying opinions about what the word “DevOps” means, I think that DevOps is the journey your team undertakes to “shift left” the ownership of applications and services in production — to have SWEs do more of the work to push code to production and operationalize it such that, at some point, they own their code in production entirely. I use the word journey intentionally — it’s not a process that teams should undergo overnight, as it is very much an Organizational Change Management concern.
If an organization adopts the “embedded SRE” model (where SREs are part of a product delivery team alongside SWEs, testers, product owners, etc and become domain experts over time), the true magic of SRE is revealed in a shared on-call rotation. SWEs and SREs in the team are registered into the same rotation for an application/service, and take shifts as their turn comes up. As a result, because they have the same duties during those shifts, they will build empathy for the role each engineering function plays in delivering resilient software. SWEs will understand the importance of runbooks and making sure they are always up to date, and they will ensure dashboards to support new features/capabilities are included in the Definition of Done before a release. They will understand the value of the Toil backlog the SREs maintain, hopefully groomed with their visibility (if not input). Similarly, SREs will come to understand the difficulty of architecting and engineering transactional distributed systems, which may not be a core skill for someone with a SysAdmin or Linux kernel background. They will each feel the pain of incidents in production, and understand the value of addressing specific issues that make on-call rotations difficult. And they will work together to address them, with each having very keen recollection of why it is important.
If possible, and if you’re leveraging the embedded model, try to move your teams into a shared SWE and SRE on-call rotation as part of your DevOps journey.