SRE - Difference Between SLAs, SLOs and SLIs

Hemant Jain
DevOps and SRE Learning
9 min readJun 4, 2022

--

The world of Site Reliability Engineering is filled with acronyms — especially ones that start with S.

In addition to SRE (which can stand for both Site Reliability Engineering and Site Reliability Engineer), there are three other essential S acronyms to know: SLA, SLO and SLI.

Understanding what SLA, SLO and SLI mean, and how they relate to SRE, can be a bit tricky. The differences between the three terms are small, yet important, and you don’t want to make the mistake of conflating these terms.

In this article, I will explain exactly what SLA, SLO and SLI mean, and discuss the similarities and differences between them.

Every tech company providing a service, whether it be free or paid, shares one similar objective: Deliver the best possible experience in order to attract and retain users. After all, without the users there is no reason (or money for that matter) for the service to exist.

When using a service, you want to be able to trust it will perform as promised. If Google suddenly became notorious for outages and slowdowns we’d likely see a mass exodus of users looking for a new search engine. Yet because of Google’s ability to consistently meet user expectations and deliver (at least) 99.99% uptime month-after-month, the search engine giant continues to dominate with over 70,000 searches every second.

Maintaining these high uptime percentages isn’t just something Google “shoots for” every month because it looks good. Their Monthly Uptime Percentage is a key indicator that is measured in order to determine whether or not they’re delivering on the promises made to their users — in this case, a search engine that works as planned 99.99% of the time. Not bad, Google, not bad at all.

These different promises or agreements that tech companies make with their customers are often defined within a Service Level Agreement (SLA). These SLAs consist of different Service Level Objectives (SLO) that are tracked and monitored by measuring specific Service Level Indicators (SLI).

Companies define, track, and monitor these SLAs, SLOs and SLIs with the goal of creating a more reliable service for their customers. But what exactly do these terms mean and how do they relate to one another?

The goal of all three things is to get everybody — vendor and client alike — on the same page about system performance. How often will your systems be available? How quickly will your team respond if the system goes down? What kind of promises are you making about speed and functionality? Users want to know — and so you need SLAs, SLOs, and SLIs.

Ref: https://www.atlassian.com

What is a Service Level Agreement ( SLA)?

An SLA, or Service Level Agreement, is an agreement made between a company and its users of a given service. The SLA defines the different promises that the company makes to users regarding specific metrics, such as service availability. For example, Google’s SLA from our example earlier promises the user a Monthly Uptime Percentage of no less least 99.99%.

What are the SLA Challenges?

Unfortunately SLAs are often written by a company’s business or legal team with little to no input from the tech team. Without involving tech in the writing process, an SLA can end up leaving out important aspects and be extremely difficult to measure.

For example, an SLA may promise that teams will resolve reported issues with Product X within 24 hours. But that same SLA doesn’t spell out what happens if the client takes 24 hours to send answers or screenshots to help your team diagnose the problem. Does it mean the team’s 24-hour window been eaten up by client slow-downs or does the clock start and stop based on when clients respond? SLAs need to answer these questions, but they often fail to do so — a fact that has created a lot of animosity toward them from IT managers.

For many experts, the answer to this challenge is, first and foremost, that tech should be involved in the creation of SLAs. The more IT and DevOps collaborate with legal and business development to develop SLAs that address real-world scenarios, the more SLAs will start to reflect key realities, such as clients delaying their own issue resolution.

To avoid this, it’s important that the business and legal teams collaborate with tech when creating SLAs. This will help create SLAs that better reflect real-world scenarios.

Who needs an SLA?

An SLA is an agreement between a vendor and a paying customer. Companies providing a service to users for free are unlikely to want or need an SLA for those free users.

What is a Service Level Objective (SLO)?

An SLO, or Service Level Objective, is the promise that a company makes to users regarding a specific metric such as incident response or uptime. SLOs exist within an SLA as individual promises contained within the full user agreement.

The SLO is the specific goal that the service must meet in order to be in compliance with the SLA. According to Google Product Managers Jay Judkowitz and Mark Carter, an SLO should “define the lowest level of reliability that you can get away with for each service.” In Google’s SLA that promises a 99.99% Monthly Uptime Percentage, the SLO is 99.99%.

What are the SLO Challenges?

A common challenge with SLOs is when they are too vague, complicated, or immeasurable. SLOs should always be simple, clearly defined, easily measured to determine whether or not the objective is being fulfilled. This will also keep your engineers from hitting roadblocks when something doesn’t make sense.

SLOs get less hate than SLAs, but they can create just as many problems if they’re vague, overly complicated, or impossible to measure. The key to SLOs that don’t make your engineers want to tear their hair out is simplicity and clarity. Only the most important metrics should qualify for SLO status, the objectives should be spelled out in plain language, and, as with SLAs, they should always account for issues such as client-side delays.

When to use SLOs ?

Where SLAs are only relevant in the case of paying customers, SLOs can be useful for both paid and unpaid accounts, as well as internal and external customers.

Internal systems, such as CRMs, client data repositories, and intranet, can be just as important as external-facing systems. And having SLOs for those internal systems is an important piece of not only meeting business goals but enabling internal teams to meet their own customer-facing goals.

SLAs vs. SLOs: What’s the Difference?

SLAs are used externally to define an agreement between a company’s service and its paid users. SLOs are objectives that are measured internally to determine whether the SLA is being met. If an SLO’s terms are violated, teams must respond and react quickly to prevent from breaking the SLA.

These SLOs are measured by closely monitoring key Service Level Indicators (SLIs).

What is a Service Level Indicator (SLI)?

An SLI, or Service Level Indicator, is a key metric used to determine whether or not the SLO is being met. It is the measured value of the metric described within the SLO. So, where Google’s SLO is 99.99%, the SLI is the actual measured value at the time. In order to remain in compliance with the SLA, the SLI’s value must always meet or exceed the value determined by the SLO. A good incident response plan is critical to quickly resolving any moments of downtime when they do happen.

What are the SLI Challenges?

To prevent over-complicating things, it’s important to keep things simple and choose the right key metrics to monitor. Tracking too many metrics will just make for more work that makes little difference to the user.

Create a detailed disaster recovery plan

What will you do when downtime strikes? If you don’t already know the answer to that question, the default answer will be “waste precious time figuring out what to do.”

The better your incident response plan, the quicker and more effectively your teams will handle incidents. Which is why the first step of any new incident management program should be process and planning.

When to Use SLIs?

Any company measuring their performance against SLOs needs SLIs in order to make those measurements. You can’t really have SLOs without SLIs.

Ref: https://www.atlassian.com

SLA, SLO, and SLI best practices

1. Craft SLAs around customer expectations

Every part of your customer agreement should be crafted around what matters to the customer. On the back end, an incident may mean addressing 10 different components. But in the client’s view, all that matters is that the system functions as expected.

Your SLAs and SLOs should reflect this reality. Don’t overcomplicate things by drilling down to a granular level and making individual promises for each of those 10 components. Keep your promises confined to the high-level, user-facing functionality. This will keep clients happier and less confused and simplify the lives of IT pros responsible for making good on your SLA promises.

2. Use plain language in SLAs

Clients won’t always ask for clarification, so if your SLA language is complicated, you’re probably setting yourself up for some painful misunderstandings down the line. The simpler your language, the less likely client conflict is in your future.

3. With SLOs, less is more

Not every metric is vital to client success, which means not every metric should be an SLO. Commit to as few SLOs as possible and focus on the ones that matter most to customers.

4. Not every trackable metric should be an SLI

Similarly, tracking performance on 10 components for each of 10 SLOs can get unwieldy very quickly. Instead, strategically choose which metrics actually matter to your core SLOs and put your energy into tracking those effectively.

5. Include factors outside the IT team’s control

What happens when the client is the one slowing down time to resolution? If you aren’t clear on this in your SLA, your team may be held to the impossible standard of resolving client issues without client involvement.

6. Build in an error budget

Leaving room for failures not only protects the business from SLA violations and hefty consequences, it also leaves room for agility — for the team to make changes quickly and have the space to try innovative new solutions that might fail.

Google actually recommends using leftover error budget for planned downtime, which can help you identify unforeseen issues (e.g. services using servers inappropriately) and maintain appropriate expectations from your clients.

7. Don’t shoot for the moon

Just because your team can probably maintain 99.99% uptime doesn’t mean that 99.99% should be your SLO number. It’s always better to under-promise and overdeliver. This is especially true for agile teams who want to launch early and often and need an error budget to keep up that quick pace.

How does this impact SREs?

For those of you following Google’s model and using Site Reliability Engineering (SRE) teams to bridge the gap between development and operations, SLAs, SLOs, and SLIs are foundational to success. SLAs help teams set boundaries and error budgets. SLOs help prioritize work. And SLIs tell SREs when they need to freeze all launches to save an endangered error budget — and when they can loosen up the reins.

Real-World Performance Benchmarks

Keep in mind that the above numbers are simply for demonstration purposes. One interesting resource for real-world figures is API.Expert, a service that queries popular APIs and posts weekly performance statistics. Since APIs are the heart of many UI-based platforms (and our digital economy at large), these benchmarks stand as a good indicator of average uptimes and latencies in the industry.

For example, at the time of writing, API.Expert’s Enterprise APIs collection ranked Pivotal Tracker at the top, with a 100.00% pass rate and a 248 ms latency, respectively. On the other end of the spectrum, GitHub is at 99.93% with an 244 ms latency and Box is at 99.99% with 406 ms latency.

SLIs, SLAs and SLOs — Oh My!

Although it may sound good to an unpracticed ear, an SLA of 99.99% still equates to 52 minutes and 36 seconds of downtime per year. That’s nearly an hour of downtime in which customers are left scratching their heads or, worse, searching for other options. For traumatic health care situations, a loss of connectivity could be a matter of life and death.

Although creating SLAs and SLOs is important to gauge system health, the reality is that it can be challenging to track and enforce them. “These agreements — generally written by people who aren’t in the tech trenches themselves — often make promises that are difficult for teams to measure,” according to the Atlassian knowledge center.

In summary, SLIs demonstrate the real behavior of software systems. These metrics inform the creation of SLAs, which must be met to ensure B2B agreements. These SLAs often reference particular service-level objectives (SLOs) that must be met, which usually give more breathing room around SLIs. Lastly, in a digital economy with accelerating digital expectations, it makes sense to monitor internal SLOs and improve baselines over time.

Please contact me if you have any queries/need any details on this at Hemant Jain

Please feel free to share your feedback and I would very much welcome it :) Thanks for reading. Have a nice day!!!

--

--

Hemant Jain
DevOps and SRE Learning

Sr. SRE at Oracle, Ex-PayPal, Ex-RedHat. Professional Graduate Student interested in Cloud Computing and Advanced Big Data Processing and Optimization.