I did a call out on Twitter to ask about what I should write about. Two of the responses resonated with me.
These are things I’ve been thinking about a lot because I’m just about launch into a new role once I get back to London after my vacation, where I will join the new Google Customer Reliability Engineering team. A large part of the job of that team is taking SRE principles and convincing those who have never experienced them before to apply them.
The opinions stated here are my own, not those of my company.
Getting an organisation to accept SRE principles starts with measuring your customer’s experience. How well is your critical application performing right now? Is it broken? Is it up? How often is it down?
Defining my terms, which I have talked about at length in my Commentary on Site Reliability Engineering:
- SLI (Service Level Indicator): Measurement of your application, indicating how well it is performing.
- SLO (Service Level Objective): The goal for your SLI, beyond which your customers will be unhappy with your service.
- Error Budget: An SLO applied over a time period, which can be ‘spent’ when out of SLO, and accumulated when availability is better than the SLO.
This is simultaneously not hard to measure, and extremely difficult to gain consensus on how to measure it.
You should expect the getting consensus and sign-off to take weeks or months. Be firm! You want sensible indicators that show when customers experience real pain! You might either find yourself talking to a void and can make all your own decisions here, or be in a room with too many decision makers.
The biggest misstep I have made in an SRE team is not being personally and collectively invested enough in having good quality SLIs. This should be a whole team effort and is literally the most important thing you will ever do.
Work at it, be patient, read lots, analyse your data. You might end up living with these decisions for a long time. Don’t stop until you’re truly happy.
Once you have these indicators, which are the biggest and hardest part of the job, things become easier because every decision your SRE group makes should center around those indicators.
On-call response actions: Entirely centered around your service-level indicators. If you have “downtime” that didn’t get detected by your SLIs, go back and fix your SLIs, and don’t let things that don’t affect the SLI be emergency actions!
Error Budgets: Celebrate being within error budget (i.e. staying within SLO), and when you run out of your budget, go exploring for the biggest cause and have meetings to address how to make sure that won’t happen again.
Project Planning: Embed your SLIs directly into the project planning. Treat any project that will address an error budget line-item as having a value equivalent to how much that will help you stay within budget. Don’t allow these to descend into qualitative terms like “too unreliable” or “that system is unstable” — Demand quantitative justifications: “This system cost us 20–30% of our error budget every month for the last year and we can get that to <1%”
Company Reports: Be visible, be quantifiable. If you are able to link your SLIs directly to company-wide performance metrics such as profit, growth and customer happiness, then reporting to your company leadership where reliability shortfalls are hurting the company, then logically you should be able to get your COO and CFO to be “on your side” in any heated discussion about priorities, even if other technical departments disagree.
If you would like to expand on any of the points above or want to give me a muse, please either comment here or on Twitter.