SRE — Principles and Practices

Avina Jain
Avina’s blog
Published in
3 min readJun 30, 2021

Whether you are a Software engineer, DevOps engineer, IT consultant or a System Admin, I am sure Site Reliability Engineering sounds intriguing to you. Let’s discuss what is SRE and its principles and practices.

What is Site Reliability Engineering?

SRE started when Treynor Sloss, at Google, tried to answer — “what happens when you ask a software engineer to design an operations function?”. As more and more people talked about and adopted SRE, it became a discipline. Now Site Reliability Engineering is an engineering field dedicated to assisting organizations in achieving the right degree of reliability in their systems and services over time.

The keyword here is reliability. We rarely see systems that are reliable 100% of the time. And, reaching 100% reliability should not even be the goal. Based on the kind of application being managed, stakeholders can specify the level of reliability needed.

Let’s discuss some key SRE principles and practices to better understand reliability, in turn, site reliability engineering.

  • SLI — SLIs or Service Level Indicators are the indicators of your service’s health. There can be a lot of indicators and ways to track them. A good SLI should not be based on noisy data from monitoring tools, instead should show that a service is working as the users expect. For example, one indicator for your service could be that it is available to the users, returns a 200 status code, and not a 5XX server error code.
  • SLO — SLOs or Service Level Objectives are well-understood goals for the reliability of the service. Once we have indicators to see how the service is functioning, we should establish what level of reliability we want from it. You set an SLO based on something that can be accurately measured and represented in your monitoring system. For example, your service returns a 200 status code 90% of the time and hence is available to users and 90% of the requests are successful.
  • Error Budget — An error budget is the difference between the theoretical perfect reliability(100%) of a service and the required reliability. For the above example, the Error budget you have is 100–90 = 10%. This means we have a fund of 10% unreliability which we can use until it is exhausted. We can focus on, say another feature development for the service and test it till we stay within that 10% budget. What steps are to be taken, when this budget is exceeded and the SLO is violated, should be agreed upon when creating that error budget.
  • Blameless Postmortem — Blameless postmortem, as the name implies, is a method of analyzing an unfavourable occurrence in hindsight without pointing fingers. The emphasis is on the failure of technology rather than on specific individuals. This ensures that we seek methods to enhance the systems or processes rather than ways to penalise people who inadvertently contributed to the outage.
  • Toil — Toil is the kind of work that is largely manual, repetitive, without any value, and grows as a service grows. An SRE’s main focus is on removing as much toil from the service as possible.

After reading the above, I am sure you have a question in mind about how DevOps and SRE are different and if/how they are related. People in the industry are still trying to present the differences properly, but we know that they are separate parallel disciplines both involving automation and monitoring operations, though SRE started with a software engineering mindset and cannot be seen as the future of DevOps, both have to be implemented differently.

Hopefully, this article has given you a good start in your understanding of SRE! 😀

--

--