Site Reliability Engineering and how it refers to the DevOps approach

Published in

Just another buzzword

5 min readSep 30, 2020

Site Reliability Engineering, or SRE for short, is intended to help determine and ensure the appropriate level of reliability for systems, services and products. Whereby “what is appropriate” depends on the requirements of the own business. SRE has its origins in 2003 at Google and can be roughly summarized in one sentence: “SRE is what you get when you treat operations as if it’s a software problem”. It is a fundamental strategy for service management:

To ensure daily operation,
To avoid errors through foresighted planning
To reflect on errors as they occur and to learn from them

Reliability is a core objective

The central term of this definition is the term reliability. For example, if a web application exists, creating it has caused some effort and costs. If it is not reliable, for example it is only available 50% of the time, all initial efforts to design the application well are irrelevant. So you have to plan in advance how to ensure reliability during operation. This also includes possible error handling in case of failure. The better e.g. the better the error messages are, the faster and easier it is identified and fixed.

Sustainability through human aspects and team composition

To ensure sustainability, people are needed to develop these systems and processes. The basic approach in SRE is to give these people the freedom to set any time and maintenance windows or development efforts in a way that ensures an overall work-life balance. This is the only way to lay the foundation for sustainability, no matter how it is technically ensured later on. It also makes sense to put together a team in such a way that later approaches for e.g. service management are ensured. For example, a software engineer who additionally possesses the knowledge of a system administrator, e.g. UNIX system internals and network knowledge (Layer 1 to Layer 3).

In general, an SRE team is responsible for the availability, latency, performance, efficiency, change management, monitoring, emergency response and capacity planning of its services.

Basic principles of SRE

For the SRE, there are some important principles that must be observed.

Embracing Risk — It is important to properly assess, manage, and use error budgets to provide useful neutral approaches to service management. It is important to cover the areas of risk measurement, risk handling and risk tolerance, including any costs
Use of service level objectives (SLO) — These try to separate the indicators from the objectives
Minimizing TOIL — Toil is the type of work associated with operating a production service that tends to be manual, repetitive, automated, tactical, and without lasting value, and that scales linearly with the growth of a service. This should be avoided as much as possible
Use of monitoring and alerting methods tailored to the application
Simplicity is key — The application and the processes surrounding it should always be kept as simple and understandable as possible.
Release engineering can be briefly described as software creation and delivery. Release engineers have a solid understanding of source code management, compilers, build configuration languages, automated build tools, package managers and installers. It is not only crucial for the stability of the overall system, since most failures are due to the fact that a change was pushed in some way. It is also the best way to ensure that versions are consistent.

SRE and DevOps

DevOps is a loose collection of practices, policies and cultures that aim to break silos in IT development, operations, networking and security. Fundamental aspects are here:

No “knowledge silos”, i.e. the expertise is divided among several people, there is not just one expert who has an exclusive area of expertise
Mistakes are normal — The second key idea is that accidents are not just the result of isolated actions by an individual, but rather the result of a lack of safety precautions in case things inevitably go wrong.
Change should be gradual — Small incremental changes instead of large update blocks
Tools and culture are interconnected — tools are an important part of DevOps, especially given the emphasis on correct change management — today, change management relies on very specific tools
Performance measurement is crucial

As we can see in the previous section, DevOps is a broad set of principles for collaboration between operations and product development throughout the entire life cycle. SRE is a working role, a set of practices. Here are basic aspects:

Operation is a software problem — The basic idea of SRE is that good operation is a software problem. SRE should therefore use software engineering approaches to solve this problem.
Manage by Service Level Objectives (SLOs) — SRE does not attempt to make everything 100% available. The product team and the SRE team select a suitable availability target for the service and its user base, and the service becomes this SLO
Ensure minimization of effort — According to the principle if a machine can perform a desired operation, then a machine should often do so
Automation — The real work in this area is determining what to automate, under what conditions and how.
Fast action by reducing the cost of failures
The developers also have “ownership” of an application
Recognizing and using the importance of tooling in all areas

Differences and similarities between the two approaches

It is important to say that SRE and DevOps are NOT mutually exclusive but rather complement each other very well and coexist in parallel.

Both approaches see that changes are needed to improve processes, systems and applications
SRE focuses on reliability, DevOps is a cultural movement that is usually associated with separate development and operational organizations.
Change management should be incremental and should be done in small “bites” of
Tooling is important for both approaches
SRE tends to be more prescriptive, DevOps is intentionally not. The almost universal assumption of continuous integration/continuous provision and the Agile Principles come closest in this respect.
For SRE, SLOs are dominant in determining the measures to improve the service.
For DevOps, the act of measuring is often used to understand what the results of a process are, how long the feedback loops last, and so on.
In SRE and DevOps, errors are part of the process and it is important to learn from them rather than looking for someone to blame.
For both DevOps and SRE, better team velocity and collaboration is the desired result

Conclusion

SRE’s two core concepts are sustainability and reliability. SRE and DevOps coexist. In order to do work that is necessary to improve the reliability of a system, an SRE’s time must be allocated so that he does not spend all his time fighting fires or processing tickets. They must have the time to write code, ensure automation, evaluate tools and develop ways to make the service and staff more efficient.