Site Reliability Engineering Demystified

What is a SRE?

SRE team

Myths and Misunderstandings

  • Error budget 100% reliability target for basically everything
  • Going from 5 9s to 100% reliability is not easy. It isn’t noticeable to most users and requires tremendous effort
  • SRE will take care all the aspect of application reliability. Set a goal that acknowledges the trade-off and leaves an error budget
  • Goal of SRE team isn’t “zero outages” — SRE and product devs are incentive aligned to spend the error budget to get maximum feature velocity
  • Monitoring should never require a human to interpret any part of the alerting domain ; however, Alert : human needs to take action immediately & Tickets : human needs to take action eventually

19-factor standard for SRE

  • Change management : 70% of outages due to changes in a live system
  • Embracing risk : If a user is on a smartphone with 99% reliability, they can’t tell the difference between 99.99% and 99.999% reliability
  • Service level objectives : Different classes of services should have different indicators
  • Eliminating toil : If a human operator needs to touch your system during normal operations, you have a bug
  • Monitoring distributed systems : Rules that generate alerts for humans should be simple to understand and represent a clear failure
  • Evolution of automation : Automation is a force multiplier, not a remedy
  • Release engineering : Release engineers work with SWEs and SREs to define how software is released
  • Simplicity (Stability vs. agility) : Can’t tell what happened if we released 100 changes together
  • Altering from time-series data : Common data format for logging
  • Being on-call : People sometimes get really rough on-call rotations a few times in a row and randomly balance out over the course of a year
  • Emergency response drill : Test-induced emergency
  • Effective troubleshooting : Tailing system logs will increse MTTR. Need right set of tools to perform effective troubleshooting
  • Postmortem culture: Blameless postmortem and learning from failure
  • Load balancing in the datacenter : Naive flow control for unhealthy tasks
  • Addressing cascading failures : Don’t do work where deadline has been missed (common theme for cascading failure). Using deadlines instead of timeouts is great
  • Data integrity : 99.99% good bytes in a 2GB file means 200K corrupt. Probably not ok for most apps
  • Defense in depth : If a hacker gains access to a system, defense in depth minimizes the adverse impact and gives engineers time to deploy updated countermeasures to prevent recurrence
  • Testing for reliability : Stability of measurement over time
  • Integrate via interconnection points : Securely connecting with partners, customers and multi-cloud using interconnection points

--

--

--

Software technologist, experience in leading, designing, architecting software products for humans & machines.

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

How to Manage Employee Termination For IT Compliance?

Razor Network Raises $3.7 Million in Private Funding To Build Truly Decentralized Oracles

Data Science Newsletter — Tree Interview Questions + FAAANG big tech companies

What is a Binary Search Tree?

10 Perks of Being a Newbie Web Developer

Though I code in both R and Python, R Markdown is my only route for writing reports, blogs or books.

Easily create game bots with AutoIt

Image of in game cookie

Adding Image Security Scanning to a CI/CD pipeline

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Venkatachalam Rangasamy

Venkatachalam Rangasamy

Software technologist, experience in leading, designing, architecting software products for humans & machines.

More from Medium

Introduction to Cloud Computing

Scaling your applications with Auto Scaling on AWS

Sql Server Migration To RDS :Instance Sizing and Migration best practices.

Cloud providers from a developer perspective