Realistic SLAs

Experiences that Prepared Me for the Cloud DevOps Engineer Exam, Part II

Aron Eidelman
Google Cloud - Community
4 min readNov 29, 2022

--

Disclosure: I am a Google employee. The ideas reflected in this post are personal and do not reflect my employer’s views. The following is Part II of my three-part blog, Experiences that Prepared Me for the Cloud DevOps Engineer Exam.

Setting objectives is one of the most challenging topics from my experience that I encountered on the exam.

The challenges involve the motivations and ideas of the people setting the objectives rather than just the technology used to obtain them.

“Hope is not a strategy.” - Traditional SRE saying

Let’s start with SLAs.

I had a situation a few years ago with a small startup I was consulting. They had an SLA on their website for “99.99% availability.” I asked what justified this number, and it turned out someone had arbitrarily picked it because it seemed like the industry norm and to coincide with the SLAs from their cloud provider.

The risk was obvious enough: if 99.99% was purely aspirational, there’s a good chance they’d fail to meet it, which would cost their business.

We decided to find a more realistic number.

Alex Ewerlöf has an excellent overview of how to calculate a composite SLA. Needless to say, once we looked more closely at the architecture, the SLAs of the cloud provider’s services, and observed the application itself, we ended up with a number more humble than 99.99%.

It relieved everyone that this lower SLA didn’t hurt the startup’s sales.

What their customers cared about wasn’t the precise number of minutes a month before they get a refund, but more so what would happen if there were an outage.

The service could block users from logging in, so it needed to fail open in case of an outage. Finding a graceful way to handle failure was a common topic of conversation, whereas the SLA was an afterthought.

Whereas any company consuming a service has probably bumped into an SLA, it’s important to note that some global services do not have or need them:

Google Search is an example of an important service that doesn’t have an SLA for the public: we want everyone to use Search as fluidly and efficiently as possible, but we haven’t signed a contract with the whole world. Even so, there are still consequences if Search isn’t available — unavailability results in a hit to our reputation, as well as a drop in advertising revenue. Many other Google services, such as Google for Work, do have explicit SLAs with their users. Whether or not a particular service has an SLA, it’s valuable to define SLIs and SLOs and use them to manage the service. (SRE Handbook, Ch. 4 — Service Level Objectives)

The topics covered in the exam tend to focus on SLOs, or the SLIs they’re based on. They share a similar framing to the above experience, and this experience showed me a two-way understanding between business stakeholders and engineering.

Engineering should look to business stakeholders to define what customers need at the “service level.” Otherwise, they may be setting objectives that are irrelevant to the business and overcommitting resources to those ends.

In my example, we discovered a slightly lower SLA didn’t affect sales after some trial and error. Predicting this type of impact in advance is possible, even in setting internal SLOs for services that don’t have SLAs.

The understanding has to be mutual. Business stakeholders should look to engineering to understand the cost and complexity tradeoffs for increasing their objectives, especially how it affects release velocity.

In my example, suppose the business insisted on sticking to its initial 99.99% claim. How much more would that have cost, and how much would it hold up development?

The fact that the risk existed and that a more modest claim would be a safer bet was ultimately sound business reasoning. It required some engineers to validate the logic and identify the hidden tension.

Determining how a technical decision affects customers, your business, and the teams building and delivering the product will help with these topics more than narrowly focusing on how systems behave internally.

Do you have similar stories to share? Check out Reliability Engineering, which has a lean coffee format where you can propose topics to discuss, and people can vote on their favorites. A discussion about SLOs in that group gave me a great mental model that helped me during the exam and helped me come up with my post on why to prioritize symptoms over causes.

In my final post coming next week, we’ll look at what to do in the inevitable case of failure.

--

--

Aron Eidelman
Google Cloud - Community

DevSecOps at Google, Board Chair at Azure Printed Homes, Dadalorian at Home