Rebuilding SRE, from Memory

Steve McGhee
2 min readOct 2, 2018

At the new gig, there is a desire for SRE. We have the book, and the new book. And now a third, even. But what are we missing?

Processes, forms, checklists.
Norms, tools,
ways of thinking.

omg omg omg stay cool

And so I find that I’m writing a lot of documents. At last count, I’m at about 150, but honestly a lot of those are meeting notes. Maybe even some postmortems :)

I think, though, it would be helpful to adapt a few of these for public consumption. So, here is a list of some things I’ve written, am writing, or intend to write Real Soon Now.

Notably, these aren’t really technical. They’re people-stuff.
But they’re SRE-people stuff. I suppose not a big surprise, if you know me :)

In no particular order:

  • Design Doc Template
  • Postmortem Template
  • Interviewing Template, Grading Rubric
  • Meeting Notes Template
  • SRE Org Structure Guidelines, Options
  • A Common Understanding of Production — “Problem explanation, discovery, recurrence, validation, prioritization, prevention”
  • Bug Response SLO
  • PostMortem Review Process
  • Release Guidelines
  • Blackbox Monitoring (Theory, Implementation)
  • Risk Analysis Template (h/t @xleem)
  • Production Reliability Principles (hey, already done!)
  • A Service Maturity Matrix
  • SLOs: Theory, Practice
  • Oncall: comp, norms, principles
  • Affecting Production While Avoiding Doom: or “using math for risk”
  • Review Boards (Product, Engineering, Production)
  • Launch Checklist (“Am I ready for traffic?”)
  • Monitoring Theory, Practice
  • Operational Norms
  • Cloud Observation Requirements
  • Debugging Distributed Systems
  • Understanding CAP, ACID, BASE
  • Asynchronous Jobs
  • Technical Debt and how to Make Progress
  • Escalating to Management: why it is a Good Idea
  • How to OKR and Why
  • Intro to Kubernetes (k8s)
  • Intro to Istio
  • Basic Cloud Topology and its Consequences
  • Intro to Cloud Load Balancing
  • Service Ownership: Beyond the NOC
  • Basic Capacity Planning for Cloudy Services
  • Documentation Requirements: Collaborating about Production

I hope, in time, to be able to publish (and improve) these so they might be helpful to the broader community.

If anything looks particularly appetizing, please let me know and I can have a place to start.

--

--