SRE Toolkit: Service Reliability Calculator and Scoring Scale

Tammy Butow
Site Reliability Engineering
2 min readSep 29, 2020

Overview

Understanding the relative reliability of your services will enable you to prioritize your reliability efforts. Determining your critical path, understanding all services along this path, and then calculating their reliability scores will enable you to identify the top services you need to focus on improving immediately.

Which Services Should I Score First?

I recommend that you first roll this out as a pilot with critical services. If you don’t yet know what your list of critical services are I recommend you spend a day in a workshop-style sync mapping out your critical path and all the services along the critical path. These services will generally be user-impacting, revenue-related services, and services which are critical dependencies to these services.

How to map out the critical path for your services

I recommend using flow chart software and starting from the first moment the user takes an action with one of your services. Then map out the actual services that are on that critical path. You may identify services that sit on the critical path but would be best not to be. This is ok, it’s great to have the simplest critical path possible, this means there will be less potential failure modes.

Service Scores

Now I will walk you through how to score the critical services you have decided to focus on. It’s important to re-score services each quarter. I recommend starting off by doing this manually in an interview-style sync with the service owner.

A System That Is A+ Grade:

  • Capacity Plan
  • Regular chaos engineering (daily)
  • No SEV 0s for 12+ months
  • No post-mortem action items open
  • 24/7 On-Call Rotation
  • Monitoring
  • Alerting
  • Disaster Recovery Plan

A System That Is A Grade:

  • Capacity Plan
  • Regular chaos engineering (weekly)
  • No SEV 0s for 6 months
  • No post-mortem action items open
  • 24/7 On-Call Rotation
  • Monitoring
  • Alerting
  • Disaster Recovery Plan

A System That Is B Grade:

  • Capacity Plan
  • Regular chaos engineering (monthly)
  • No SEV 0s for 3 months
  • 24/7 On-Call Rotation
  • Monitoring
  • Alerting
  • Disaster Recovery Plan

A System That Is C Grade:

  • Capacity Plan
  • Regular chaos engineering (quarterly)
  • No SEV 0s for 3 months
  • Business Hours On-Call Rotation
  • Monitoring
  • Alerting
  • Disaster Recovery Plan

A System That Is D Grade:

  • Capacity Plan
  • Regular chaos engineering (quarterly)
  • Business Hours On-Call Rotation
  • Monitoring
  • Alerting
  • Disaster Recovery Plan

A System That Is E Grade:

  • This is a bucket grade for services that do not meet the criteria for an A-D score

--

--