Building a Successful SRE Team

Sven Hans Knecht
7 min readJun 12, 2023

--

Successful techniques to ensure your SRE team delivers value

Iceberg bigger under the water than on top

Introduction

When I joined Mission Lane, I was one of two Site Reliability Engineers (SREs) hired; the other was my boss, the manager for the SRE team and the eventual Director for the Platform Organization. We were given a mandate to build out an SRE team that would own Observability and Developer Experience. This would begin a three year journey that built a successful SRE organization that delivered immense value to Mission Lane. The SRE team was made of four individuals who support 250 microservices, 130 developers, hundreds of releases to various environments each day, and nearly a billion logs and traces each day.

The SRE team would go on to:

  • Build a standardized helm chart for Mission Lane Developers to use
  • Manage automatic distributed tracing
  • Create an observability stack that would process half a million logs and traces per second
  • Handle automatic canary releases for applications that wanted it significantly lowering
  • Managed automatic dependency updates
  • Build out a useful service catalog

And along with our sister teams, the Cloud Platform Engineering (CPE) and DevSecOps (DSO), we’d also build a nearly completely self-service developer platform that allowed a developer to go from idea to production with only three PRs (and some discussion on naming)

This would also represent the second time that I was part of building a successful SRE team, the first being at Capital One. Here’s four of the lessons I learned that should help you build a successful SRE organization.

  1. Focus on Developer Training
  2. Focus on the Right Abstractions
  3. Focus on Self Service
  4. Automate Yourself out of a job

Focus on Developer Training

When you spend all day working on a particular piece of technology or platform, you become an expert in it. And when you become an expert in a piece of technology, you become used to the quirks, problems, and edge cases of your platform. Further, you quickly run into the power user problem where you want as much customization as possible. You know exactly what is available, how it all works/is connected, and have a great mental model for how things work.

However, the developers –your customers– do not. They are focused on delivering business value via their services. Development Teams have various levels of engineering maturity that allow them space to focus on reliability, tooling, etc. versus their immediate needs. They need to be able to deliver value as quickly and easily as possible, while not compromising on security, reliability, or scalability. This is where developer training becomes critical

At Mission Lane, we had a central SRE team that supported some 20 product teams owning about 250 micro-services. Every quarter, we would pick between two to four product teams and embed an SRE into the team for a month. During this month, the SRE had a checklist of health items to focus on, to keep scope limited, but their primary goal was to train the developers, learn how the developers interacted with the platform, learn about all the little problems developers ran into, and generally make the lives of developers easier by teaching them to fish. Here are some examples of problems SREs would help solve:

  1. Fixing tracing that isn’t working for a particular uncritical –yet deeply annoying– endpoint that it isn’t working
  2. Helping developers understand how canary releases using Flagger and Istio worked
  3. Helping developers add to a dashboard, create an alert, tune or silence a noisy alert
  4. Help developers be able to deploy from local to the dev cluster

This was a program we started about one year into having an SRE team and it was wildly successful. Developers loved having a short, focused interaction with an SRE. It allowed the SRE to build a connection with the product teams, it showed us a variety of problems or concerns developers ran into, and it helped build the general knowledge of the engineering organization.

Focus on the Right Abstractions

The DevOps culture shift focused primarily on shifting left. As organizations have matured, we’ve realized that we also needed to shift down. We need to write the correct abstractions and help teams do more with less.

Early on, Mission Lane had no abstractions over our Kubernetes (k8s) cluster. At the time, we had a really powerful GitOps pipeline built on ArgoCD, GKE, and CNRM. However, developers needed to write k8s manifests by hand that were then applied with kustomize via ArgoCD to the k8s clusters. While this resulted in a lot of YAML duplication, the real problem was when we needed to apply a mass update. Need to set a securityContext for all deployments? Need to swap to a new Ingress? Want to apply an environment variable or annotation? You’d have to go edit hundreds of yaml files. And while some of that could be automated via your favorite programming language, the community had already solved this problem.

About nine months after the SRE team was formed, we released version 1.0.0 of the ML Service Helm Chart. This chart would eventually be used by 95% of services running at Mission Lane and would see hundreds of releases. It would allow teams to get up and running reliably and securely in our clusters. It allowed for extreme customization, in most cases following the k8s api spec exactly allowing for overriding all of our settings, while providing sane defaults that promoted good application health and practices.

This helm chart allowed us to solve problems for the entire organization. When we found a setting that needed to be added, we could do so for the entire organization by publishing a new version of the helm chart with updated configuration. We strictly followed the helm version of Don’t Break User Space by writing extensive tests and managing breaking api changes automatically.

This paradigm of shifting down rather than shifting left shows up in tooling, the abstractions we write, and the way we talk about developers. Treat them as experts in their field, recognize they probably aren’t experts in your field, and see how you can help them be successful in a self-service manner.

Focus on Self-Service

An SRE team should be able to be a force multiplier for an engineering organization. If you have to hire an SRE for each product team created, you’ve failed. Focusing on allowing developers to make self-service decisions and only come ask for help when something goes wrong allows for force multiplication. Amazon and Google don’t force you to talk to support every time you want to turn on a service, or to talk to an engineer to launch a new product. Rather, they enable you via APIs and UIs to be self-service and only come talk to them when you can’t figure it out. If you’ve written good documentation and have an intuitive process/api an infinite number of developers can be helped, rather than if you have to hop on a call with each one.

This philosophy existed at Mission Lane before the SRE team existed. It’s a credit to the CPE team that the underlying GKE cluster and automation tooling was so good when SRE was created. But focusing on self-service allowed us to basically run an office hours model where once a week developers would show up and ask questions, and then have a slack support channel that allowed developers to ask and get answers.

Use tooling that allows a developer to safely interact with the cluster and service. Make sure that your process prevents misconfiguration without limiting choice. When issues arise ensure that you evaluate the process or tooling to see where it could have been better. Trust your developers. Developers are like users. They are rarely trying to do the wrong thing, they probably just have an xy situation going on.

Use tooling that gives feedback quickly and in the same location. Early on, CPE decided that the PR was going to be the center point of all feedback. Using Trunk Based Development and having all the tools interact with a PR meant that there was a consistent mental model and teams could self-service with only the use of codeowners and reviews. Nearly every tool the Platform Teams introduced used the PR as the interface mechanism and this greatly improved the developer speed and feedback.

Automate Yourself Out of a Job

Eliminating toil is a key tenant of being an SRE. It’s a foundational aspect of the job and the philosophy that underpins it. If you are repeatedly doing the same task over and over, remediating the same problem every time it occurs, or even writing really good runbooks, those are signs that you aren’t eliminating enough toil. While machines should not provide approvals, nearly everything else can be automated away. Opening of certain PRs, templating of repos and files, even responding to certain errors can all be automated away.

This was a core tenant of SRE at Mission Lane. We strove to automate ourselves out of a job as much as possible. Whether this meant using tools like Cortex.io to build out templates and scorecards, writing our own db-analyzer to help teams tune their connection settings and database sizing, writing tools that would auto remediate Elasticsearch issues, or using off the shelf analyzers in PRs to help catch common issues, we strove to eliminate the need for us to do a particular task. This allowed us to focus on ever higher order and bigger organization wide problems. It also allowed us to scale and support a quick moving development organization with a limited number of SREs.

Conclusion

Building a successful SRE team is hard. You need extremely good engineers who know how to focus on the right problems and are excellent communicators. But it’s absolutely worth it in the end. An SRE team will be a force multiplier for the entire organization reducing incidents, improving the developer experience, and improving the quality of the code. Just remember:

  1. Focus on Developer Training. Improving your developers knowledge and expectations carries massive benefits.
  2. Focus on the Right Abstractions. Shift down, not just left. Abstract away the things that don’t matter for your customers.
  3. Focus on Self-Service. Developers should be able interact with the platform completely autonomously. They shouldn’t need to talk to you to do anything standard.
  4. Automate Yourself Out of a Job. Keep pushing. Don’t let yourself be complacent. Push to continuously improve the automation so you can do new and interesting things!

--

--

Sven Hans Knecht

SRE/Platform Engineer Professional. Amateur Analytics and Sports Enthusiast