SRE Public Resources for GCP Customers

Ayelet Sachto

2 min readApr 1, 2021

General Resources:

SRE books (SRE, Workbook, Building Secure & Reliable Systems)
A Practical Guide to Moving to Cloud
SRE success starts with getting leadership on board
Four steps to jumpstarting your SRE practice
Anatomy of an incident
[Coursera]Developing a Google SRE Culture
[Coursera]Site Reliability Engineering: Measuring and Managing Reliability
Art of SLOs classroom: The Art Of SLOs

How can you get started with SRE as an engineer/practitioner?

Intro:

[Recording]Getting Started with SRE by Jennifer Petoff
[Youtube playlist]Class SRE implements DevOps (Alternative: blog post)

Start by understanding the key concepts and terminology(Reliability, CUJ, SLO, SLI, Error budget)

[Blog]SRE fundamentals: SLIs, SLAs and SLOs
[Recording] ‘Achieving Resiliency on Google Cloud’ by Ben Treynor Sloss
SLOs Cheatsheet
Run The art of SLO workshop internally (SLOs Public resources)
Implement your learning in your own payload and create your first SLI/SLO

Start Simple and iterate : Start with 1–3 CUJ and 1–3 SLI/SLO each, use the data that you currently have and solve gaps that you identify along the way.

Additional: Reliability in GCP 5x9 in Google by Brad Calder

GCP Operations (formerly Stackdriver):

[Demo]Measuring Reliability: Step by Step SLO creation, Cloud OnAir.
[Blog]Check this step by step guide to get started Measuring Reliability in GCP: Step By Step SLO creation guide using Cloud Operation Sandbox.
[Qwiklabs]Measure Site Reliability using Cloud Operations Suite
[Blog]Setting up Cloud Operations for GKE , Troubleshooting services on GKE| by Yuri Grinshteyn
[Coursera]Logging, Monitoring and Observability in Google Cloud

Advanced:

NALSD Resources:

Software Engineering Advice from Building Large-Scale Distributed Systems by Jeff Dean
Introducing Non-Abstract Large System Design(SRE book-Ch12)
NALSD workshops sre.google/classroom
Distributed Log-Processing Design Workshop
Borg cluster management for distributed computing
Google distributed file system.
Numbers everyone should know [Recording] | [PDF]

Incident Management:

You can find more Google’s SRE content at DevOps & SRE Cloud Blog and CRE Life Lessons