Site Reliability Engineering Demystified

4 min readMar 7, 2017

what happens when you ask a software engineer to design an operations function

Inspiration for writing this article came after a recent debate with one of my friend on “what is the real meaning of term Site Reliability Engineering (SRE)” in Cloud-Native world. Focus of this article is to demystify basic concepts of SRE.

I tried my best to provide references and links to original sources and authors.

What is a SRE?

Fundamentally, what happens when you ask a software engineer to design an operations function — Ben Treynor, VP of Engineering Google

Organization partially composed of folks who were software engineers, and who were inclined to use software as a way of solving problems that had historically been solved by hand. So when it was time to create a formal team to do this operational work, it was natural to take the “everything can be treated as a software problem” approach and run with it.

Set of known standards are enforced in a systematic manner for making it more effective at running efficient, high availability, large scale systems

SRE team

SRE team responsible for latency, performance, efficiency, change management, monitoring, emergency response, and capacity planning

Typically 50–50 mix of people who have more of a software background and people who have more of a systems background will be a really good mix for SRE team.

Myths and Misunderstandings

Error budget 100% reliability target for basically everything
Going from 5 9s to 100% reliability is not easy. It isn’t noticeable to most users and requires tremendous effort
SRE will take care all the aspect of application reliability. Set a goal that acknowledges the trade-off and leaves an error budget
Goal of SRE team isn’t “zero outages” — SRE and product devs are incentive aligned to spend the error budget to get maximum feature velocity
Monitoring should never require a human to interpret any part of the alerting domain ; however, Alert : human needs to take action immediately & Tickets : human needs to take action eventually

19-factor standard for SRE

Humans add latency, solving problems that had historically been solved by hand. Systems that don’t require humans to respond will have higher availability due to lower MTTR(mean-time-to-recovery).

Having hero generalists who can respond to everything works, but having playbooks for the below aspect (19-factor standard for SRE) works better. Additions will be made to 19-factor standard over time as new approach and design patterns are adopted.

Change management : 70% of outages due to changes in a live system
Embracing risk : If a user is on a smartphone with 99% reliability, they can’t tell the difference between 99.99% and 99.999% reliability
Service level objectives : Different classes of services should have different indicators
Eliminating toil : If a human operator needs to touch your system during normal operations, you have a bug
Monitoring distributed systems : Rules that generate alerts for humans should be simple to understand and represent a clear failure
Evolution of automation : Automation is a force multiplier, not a remedy
Release engineering : Release engineers work with SWEs and SREs to define how software is released
Simplicity (Stability vs. agility) : Can’t tell what happened if we released 100 changes together
Altering from time-series data : Common data format for logging
Being on-call : People sometimes get really rough on-call rotations a few times in a row and randomly balance out over the course of a year
Emergency response drill : Test-induced emergency
Effective troubleshooting : Tailing system logs will increse MTTR. Need right set of tools to perform effective troubleshooting
Postmortem culture: Blameless postmortem and learning from failure
Load balancing in the datacenter : Naive flow control for unhealthy tasks
Addressing cascading failures : Don’t do work where deadline has been missed (common theme for cascading failure). Using deadlines instead of timeouts is great
Data integrity : 99.99% good bytes in a 2GB file means 200K corrupt. Probably not ok for most apps
Defense in depth : If a hacker gains access to a system, defense in depth minimizes the adverse impact and gives engineers time to deploy updated countermeasures to prevent recurrence
Testing for reliability : Stability of measurement over time
Integrate via interconnection points : Securely connecting with partners, customers and multi-cloud using interconnection points

In future articles, we’ll look at just that: practical steps for taking a step towards SRE, and the role of 19-factor standards for SRE. Stay tuned.

Reference:

Site Reliability Engineering -How Google Runs Production Systems

Interconnection Oriented Architecture

-Venkat

“I am an employee of Equinix. The opinions expressed here are my own and do not necessarily reflect the opinions of Equinix”

Site Reliability Engineering Demystified

What is a SRE?

SRE team

Myths and Misunderstandings

19-factor standard for SRE

Written by Venkat Rangasamy