SRE Public Resources for GCP Customers
2 min readApr 1, 2021
General Resources:
- SRE books (SRE, Workbook, Building Secure & Reliable Systems)
- A Practical Guide to Moving to Cloud
- SRE success starts with getting leadership on board
- Four steps to jumpstarting your SRE practice
- Anatomy of an incident
- [Coursera]Developing a Google SRE Culture
- [Coursera]Site Reliability Engineering: Measuring and Managing Reliability
- Art of SLOs classroom: The Art Of SLOs
How can you get started with SRE as an engineer/practitioner?
Intro:
- [Recording]Getting Started with SRE by Jennifer Petoff
- [Youtube playlist]Class SRE implements DevOps (Alternative: blog post)
Start by understanding the key concepts and terminology(Reliability, CUJ, SLO, SLI, Error budget)
- [Blog]SRE fundamentals: SLIs, SLAs and SLOs
- [Recording] ‘Achieving Resiliency on Google Cloud’ by Ben Treynor Sloss
- SLOs Cheatsheet
- Run The art of SLO workshop internally (SLOs Public resources)
- Implement your learning in your own payload and create your first SLI/SLO
Start Simple and iterate : Start with 1–3 CUJ and 1–3 SLI/SLO each, use the data that you currently have and solve gaps that you identify along the way.
Additional: Reliability in GCP 5x9 in Google by Brad Calder
GCP Operations (formerly Stackdriver):
- [Demo]Measuring Reliability: Step by Step SLO creation, Cloud OnAir.
- [Blog]Check this step by step guide to get started Measuring Reliability in GCP: Step By Step SLO creation guide using Cloud Operation Sandbox.
- [Qwiklabs]Measure Site Reliability using Cloud Operations Suite
- [Blog]Setting up Cloud Operations for GKE , Troubleshooting services on GKE| by Yuri Grinshteyn
- [Coursera]Logging, Monitoring and Observability in Google Cloud
Advanced:
- How Google’s SRE debug production issues
- The Calculus of Service Availability
- [Recording]Designing for Reliability in Production (Podcast)
- Are your SLOs realistic? How to analyze your risks like an SRE
NALSD Resources:
- Software Engineering Advice from Building Large-Scale Distributed Systems by Jeff Dean
- Introducing Non-Abstract Large System Design(SRE book-Ch12)
- NALSD workshops sre.google/classroom
- Distributed Log-Processing Design Workshop
- Borg cluster management for distributed computing
- Google distributed file system.
- Numbers everyone should know [Recording] | [PDF]
Incident Management:
- [Recording]Maintaining reliable systems, Conf42
- Incident Response
- Postmortem Culture: Learning from Failure
- Postmortem Action Items: Plan the Work and Work the Plan
- Shrinking the impact of production incidents using SRE principles — CRE Life Lessons
- Shrinking the time to mitigate production incidents using SRE principles — CRE Life Lessons
You can find more Google’s SRE content at DevOps & SRE Cloud Blog and CRE Life Lessons