Site Reliability Engineering Demystified

what happens when you ask a software engineer to design an operations function

Inspiration for writing this article came after a recent debate with one of my friend on “what is the real meaning of term Site Reliability Engineering (SRE)” in Cloud-Native world. Focus of this article is to demystify basic concepts of SRE.

I tried my best to provide references and links to original sources and authors.

What is a SRE?

Fundamentally, what happens when you ask a software engineer to design an operations function — Ben Treynor, VP of Engineering Google

Organization partially composed of folks who were software engineers, and who were inclined to use software as a way of solving problems that had historically been solved by hand. So when it was time to create a formal team to do this operational work, it was natural to take the “everything can be treated as a software problem” approach and run with it.

Set of known standards are enforced in a systematic manner for making it more effective at running efficient, high availability, large scale systems

SRE team

SRE team responsible for latency, performance, efficiency, change management, monitoring, emergency response, and capacity planning

Typically 50–50 mix of people who have more of a software background and people who have more of a systems background will be a really good mix for SRE team.

Myths and Misunderstandings

  • Error budget 100% reliability target for basically everything
  • Going from 5 9s to 100% reliability is not easy. It isn’t noticeable to most users and requires tremendous effort
  • SRE will take care all the aspect of application reliability. Set a goal that acknowledges the trade-off and leaves an error budget
  • Goal of SRE team isn’t “zero outages” — SRE and product devs are incentive aligned to spend the error budget to get maximum feature velocity
  • Monitoring should never require a human to interpret any part of the alerting domain ; however, Alert : human needs to take action immediately & Tickets : human needs to take action eventually

19-factor standard for SRE

Humans add latency, solving problems that had historically been solved by hand. Systems that don’t require humans to respond will have higher availability due to lower MTTR(mean-time-to-recovery).

Having hero generalists who can respond to everything works, but having playbooks for the below aspect (19-factor standard for SRE) works better. Additions will be made to 19-factor standard over time as new approach and design patterns are adopted.

  • Change management : 70% of outages due to changes in a live system
  • Embracing risk : If a user is on a smartphone with 99% reliability, they can’t tell the difference between 99.99% and 99.999% reliability
  • Service level objectives : Different classes of services should have different indicators
  • Eliminating toil : If a human operator needs to touch your system during normal operations, you have a bug
  • Monitoring distributed systems : Rules that generate alerts for humans should be simple to understand and represent a clear failure
  • Evolution of automation : Automation is a force multiplier, not a remedy
  • Release engineering : Release engineers work with SWEs and SREs to define how software is released
  • Simplicity (Stability vs. agility) : Can’t tell what happened if we released 100 changes together
  • Altering from time-series data : Common data format for logging
  • Being on-call : People sometimes get really rough on-call rotations a few times in a row and randomly balance out over the course of a year
  • Emergency response drill : Test-induced emergency
  • Effective troubleshooting : Tailing system logs will increse MTTR. Need right set of tools to perform effective troubleshooting
  • Postmortem culture: Blameless postmortem and learning from failure
  • Load balancing in the datacenter : Naive flow control for unhealthy tasks
  • Addressing cascading failures : Don’t do work where deadline has been missed (common theme for cascading failure). Using deadlines instead of timeouts is great
  • Data integrity : 99.99% good bytes in a 2GB file means 200K corrupt. Probably not ok for most apps
  • Defense in depth : If a hacker gains access to a system, defense in depth minimizes the adverse impact and gives engineers time to deploy updated countermeasures to prevent recurrence
  • Testing for reliability : Stability of measurement over time
  • Integrate via interconnection points : Securely connecting with partners, customers and multi-cloud using interconnection points

In future articles, we’ll look at just that: practical steps for taking a step towards SRE, and the role of 19-factor standards for SRE. Stay tuned.

Reference:

Site Reliability Engineering -How Google Runs Production Systems

Interconnection Oriented Architecture

-Venkat

“I am an employee of Equinix. The opinions expressed here are my own and do not necessarily reflect the opinions of Equinix”