Starting an SRE Team? Stay Away From Uptime.

Dev Interrupted
Dev Interrupted
Published in
5 min readDec 1, 2021

This article was written for Dev Interrupted by Conor Bronsdon

A good SRE engineer will tell you your service is never down. A great SRE engineer will tell you that’s not what you should be measuring. In fact, they’ll tell you their job is customer service.

Site Reliability Engineering (SRE) has grown immensely popular with many of the world’s largest tech companies, like Netflix, LinkedIn and Airbnb employing SRE teams to keep their systems reliable and scalable.

Along the way, SRE engineers have become one of the most sought after engineering roles in tech.

The role is traditionally understood as ensuring that services are reliable and unbroken, but reliability and uptime aren’t perfect metrics. Perhaps what organizations should be asking themselves is what their customers think of their service.

Wandering down to your engineering department and asking your SRE team about customer satisfaction is a good place to start.

Their answer just might surprise you.

History of SRE

In practice, Site Reliability Engineering has been around for a while. In the past its functions were covered by roles that had names like production ops, disaster recovery, testing or monitoring. The rise of cloud computing facilitated a need for more engineers in production. The complexity only grew as more organizations transitioned from monolithic infrastructures to distributed microservices.

Modern Site Reliability Engineering originated at Google in 2003 with the work of Benjamin Treynor, who is seen as the “father” of what we now simply call SRE. Treynor, who coined the term, was a software engineer placed in charge of running a production team. With the goal of making Google’s website as reliable and serviceable as possible, he asked that his team spend half their time on operations tasks so they could better understand software in production. This team would become the first-ever SRE team.

Ben Treynor said, I’m paraphrasing, ‘[SRE] is essentially like throwing a software engineer at an operations problem’, right? Because you come from that developer mindset, that design and, you know, you think about all of these things. So think about it as a developer but apply it to an operational type of problem.” — Brian Murphy on the Dev Interrupted podcast at 4:26

Why not uptime?

So why shouldn’t you be too concerned about your uptime metrics? In reality SRE can mean different things to different teams but at its core, it’s about making sure your service is reliable. After all, it’s right there in the name.

Because of this many people assume that uptime is the most valuable metric for SRE teams. That is flawed logic.

For instance, an app can be “up” but if it’s incredibly slow or its users don’t find it to be practically useful, then the app might as well be down. Simply keeping the lights on isn’t good enough and uptime alone doesn’t take into account things like degradation or if your site’s pages aren’t loading.

It may sound counterintuitive, but SRE teams are in the customer service business. Customer happiness is the most important metric to pay attention to. If your service is running well and your customers are happy, then your SRE team is doing a good job. If your service is up and your customers aren’t happy, then your SRE team needs to reevaluate.

A more holistic approach is to view your service in terms of health.

The Four Golden Signals

As defined by Google , these are the four golden signals of SRE. If these can be managed effectively, then you probably have a healthy system.

  • Latency: Involves response time and the time it takes to service a request.
  • Traffic: Is a measure of the demand that is being placed on your system. E.g. how many messages are you getting; can you handle them?
  • Errors: The rate of requests that fail. E.g. running an HTTP server that is returning a lot 500s is bad.
  • Saturation: Is a way of thinking about the capacity of your system. E.g. is your service being overwhelmed?

Establishing system health

“The best way to get started is just measuring stuff, you know, just getting the baseline of what’s healthy, what’s not healthy, what looks like health, and then you can start working from there.” — Brian Murphy on the Dev Interrupted podcast at 10:49

It can be difficult to know whether or not your organization should consider forming an SRE team, or what your next steps are if you’ve already made the decision.

Again, think of your decision in terms of a holistic approach, not just your uptime. If you have high uptime, that’s fantastic, but what you should be establishing is a benchmark.

Using the four golden signals to guide you, establish what you think a healthy system should look like and set your benchmark. Keep measuring over time and you will begin to see the areas that are good or require more work.

These measures will help inform all of your future decisions. Perhaps your organization is ready to roll out new features or make choices around expanding your service.

Critically, the health you establish provides insights into customer happiness. If things look good you probably have happy customers.

Internal customers

When done right SREs aren’t just making customers happy, they’re making the lives of developers easier too. Nothing is worse than having to stop because there’s a problem in production. Good SRE teams can shield dev teams by focusing on major hotspots. If the fires are being managed before they are out of control, it allows developers to keep pushing out features. It even gives them the freedom to keep breaking things, if necessary!

When things do break, or require a slowdown, a dialogue can occur. A good SRE understands that the developer who wrote a piece of code understands it better than anyone. The model for good internal customer service is an SRE who brings in a developer, gives them ownership of the code they created, and offers to help them fix it.

Happy customers are the best customers

Whether you already have an SRE team or are thinking about forming one, remember to think beyond the engineering — think about the customer.

Ask yourself if your customers are happy and if you would describe your service as healthy. Remember to think about your own teams as well, your developers will thank you for it.

A 3-part Summer Workshop Series for Engineering Executives

Engineering executives, register now for LinearB’s 3-part workshop series designed to improve your team’s business outcomes. Learn the three essential steps used by elite software engineering organizations to decrease cycle time by 47% on average and deliver better results: Benchmark, Automate, and Improve.

Don’t miss this opportunity to take your team to the next level — save your seat today.

Originally published at https://devinterrupted.com on December 1, 2021.

--

--

Dev Interrupted
Dev Interrupted

The Dev Interrupted podcast and its articles and podcasts are made exclusively for dev leaders, featuring expert guests from around the world.