“Support squads”: An agile approach to supporting large systems
A few years ago, I returned from parental leave to discover that my teams had grown by 25 percent. Amazing! And very much needed — our support rotations had been in need of relief for some time.
Our group was organized into three scrum teams, each focused on a specific set of features and stakeholders. One was particularly frontend-oriented and handled the majority of our inbound user questions (which didn’t have a designated support team.) Another specialized in infrastructure projects and handled any automated alerts from our monitoring system. As our product and userbase grew, this operational work took up more and more time. Engineers were getting burned out, and development velocity on those two teams was suffering.
I wanted to spread the load of this work across more engineers. But, in previous experiences, I had found that large rotations can be problematic:
- If the surface area of the system being supported is too large, it can be difficult to deeply understand all the relevant details, and ongoing training to keep engineers up to date can be time-consuming.
- If individual engineers are only on-call for a system once per quarter, they don’t have a strong incentive to spend time making the experience better.
So, we concluded it was best to split up the responsibilities. But this introduced a new problem: Our code lived in a monolith, and we hadn’t yet established fully clean technical or product boundaries between our teams. How should we organize support for a system that spanned multiple teams, while still avoiding the pitfalls of large rotations? Our solution was a model that we call “Support Squads.”
Introducing “Support Squads”
A Support Squad is a group of five to seven engineers that exists outside of any individual scrum team and is tasked with servicing and improving one aspect of supporting our systems. Each support squad has a charter, defined success metrics, and a squad leader who is accountable for running it. Each engineer in our group chooses a few squads to join — in our case, the math worked out to two squads per person.
To scope each squad, we looked for tasks that needed ongoing engineering investment but were not clearly owned by a single product team. This led us to identify four kinds of support that we wanted to staff for our first iteration:
- A pager duty squad that was on-call to handle automated alerts,
- A first-line support squad that responded to user issues and questions,
- A release engineering squad that handled our weekly production deployments, and
- A performance engineering squad that specialized in troubleshooting performance issues across our stack.
Among this model’s advantages:
- It allows us to increase our support bandwidth without requiring everybody to learn everything.
- Specific people are now accountable for keeping each rotation healthy. This also creates new leadership opportunities for our senior ICs.
- Engineers feel more invested in helping out with support, since they get to choose how they spend their time based on what skills they would like to practice.
- We can keep things new and engaging by rotating engineers across squads over time.
You can’t judge a decision if you don’t measure the results, so we made sure to gather data on how well our support squads are operating.
And the result: success! Engineer satisfaction with support work (measured with a quarterly survey) increased significantly, and overall reliability of our systems improved across the board, thanks to continuous investment and process improvements from the squads.
This model provides us with another long-term benefit: As our product teams evolve, the support squads can change with them. Over the years, we have adapted squad missions and structures to fit new needs for our teams. The goal, though, will always remain the same: to ensure a smooth and productive support experience for our users and engineers alike.