Portrait of Adrien Bestel, Principal Ops Engineer @ tb.lx, with relaxed arms and a black t-shirt. Title: Navigating Service Level Objectives Series — A practical guide to reliability in tb.lx’s transportation world
A practical guide to reliability in tb.lx’s transportation world with Adrien Bestel, Principal Ops Engineer @ tb.lx

Navigating Service Level Objectives Series: A practical guide to reliability in tb.lx’s transportation world

tb.lx
tb.lx insider

--

Hello there! I am Adrien, Principal Operations Engineer at tb.lx, and I’m excited to kick off this series of articles. Here at tb.lx, we’re on a mission to create a world of connected and sustainable transportation. As we are ramping up our DevOps maturity, we also want to implement the Site Reliability Engineering (SRE) discipline in our product teams. This series of articles is all about implementing Service Level Objectives (SLOs), a crucial aspect of both DevOps and SRE.

Embracing Reliability: Our journey adopting SLOs

Our journey adopting SLOs started modestly, by merely tracking the daily availability of our REST APIs. This initial setup, though simple, was our first step into understanding the health and performance of our products.

However, the real turning point came when one of our product teams was tasked with overhauling a legacy system that had become notorious for its instability. This system had reached a critical tipping point in scale, where it was not performing adequately, leading to a dissatisfied and frustrated user base.

For the team, the mission was two-fold. First, to understand the performance of different aspects of the legacy system, identifying which parts were functioning well and which were not. Second, to bootstrap a brand-new platform that would not only match the feature set of the existing system but would prioritize reliability from the get-go.

Implementing a comprehensive framework for SLOs became an essential part of this mission. It allowed us to treat reliability as a first-class citizen throughout the entire lifecycle of the new platform. By discussing objectives with stakeholders, ensuring that the developed solutions met the SLOs, and handling incidents efficiently to avoid breaching SLOs, we managed to build a system that addresses the major pain points affecting our users.

The impact of this transformation was profound. It not only improved the product and satisfied our stakeholders but also ensured the satisfaction of our clients. Encouraged by this success, we decided to make using SLOs a standard practice across all our teams.

This series of articles is a reflection of our journey. We will share the lessons we learned, the challenges we faced, and provide a detailed account of how we implemented SLOs. Our hope is that, by sharing our experience, we can help other organizations embark on their own journey towards adopting SLOs, and ultimately building more reliable and robust systems.

From Ideas to Practice

SLOs are like a promise we make to ourselves and our users. They’re a way to say: “This is how well our services should work, and we won’t settle for less”.

Beyond service quality, SLOs have a critical impact on resource optimization, incident management, and customer satisfaction, making them an essential tool for any organization aiming for operational excellence.

We’ll start at a high level, explaining what SLOs are and why they are important. However, we’re not just focusing on theory in this series. We’re rolling up our sleeves to show you how SLOs can turn into action. It’s not about using fancy words — it’s about making sure our products work, and we’ll share the tools and tricks we’re using.

We won’t leave you guessing. You’ll get a clear picture of how SLOs fit into different parts of our work, whether it’s handling data, managing APIs, or more. We’re here to make it all make sense.

First Part: Demystifying Service Level Objectives

In this first article, our goal is to set the terms straight. Service Level Objectives are a core concept of Site Reliability Engineering and are associated with other concepts such as Service Level Indicators, Service Level Agreements, or Error Budgets. Understanding these concepts is key to implementing SLOs effectively.

It’s important to note that the concepts surrounding SLOs are not new. They have been well-defined in the comprehensive SRE book by Google. For more detailed information and to verify the concepts discussed below, you can refer to Google’s SRE book at this link: Google SRE Book — Service Level Objectives.

First, we will define each concept, and then we will illustrate them, using a simple, relatable example.

Concepts

Service Level Indicators (SLIs)

SLIs are the metrics employed to assess the quality of our service as perceived by our users. Accurately identifying and measuring SLIs is fundamental to setting meaningful SLOs and ultimately ensuring customer satisfaction.

For example, an SLI could be the time taken to process a request (latency) or the frequency of errors in a service (error rate).

Service Level Objectives (SLOs)

SLOs are the targets that we set for our SLIs. They define the level of service we aim to provide in order to make our users happy. In essence, they represent the goals that our service aspires to achieve in terms of performance and reliability.

For example, an SLO could be that “99.99% of requests to a service should be processed in less than 300ms”.

Service Level Agreements (SLAs)

SLAs are formalized agreements between the service provider and the user. They outline the expected level of service, including the SLOs, and often come with penalties for the service provider, if the SLOs are not met.

It is important to note that SLAs are legally binding agreements, and failing to meet the agreed-upon levels of service, can have financial or other contractual repercussions.

SLAs can be the same as SLOs, or more relaxed. For example, an SLA could be that “99.9% of requests to a service should be processed in less than 300ms”.

Error Budget

The error budget represents the amount of acceptable downtime of errors for a service over a specific period, calculated based on the SLO.

As the service operates, any downtime or errors will consume the error budget. If the service operates perfectly, the error budget will remain intact.

Once the error budget is depleted, it indicates that the service is not meeting its agreed-upon objectives (meaning the SLO has been breached).

Burn Rate

The burn rate is the rate at which the error budget is being consumed.

If the error budget is consumed too quickly (e.g., half of the monthly error budget is consumed in a day), this indicates that the service is not operating as expected and corrective actions may be needed.

This is especially useful for operations because you can build alerting on top of it. For more details on how to build alerting on burn rates, refer to this Google SRE workbook.

Rolling Windows

SLOs are usually associated to a rolling window, which is a continuously moving time window. For example, if we have a rolling window of 30 days, then at any given point, we are calculating the Availability and Error Budget based on the data from the past 30 days.

Using large rolling windows, such as 30 days or 90 days, for SLO calculation helps smooth out short-term fluctuations, avoid abrupt changes and unnecessary alerts, and provide a more stable and consistent view of service quality compared to instantaneous monitoring over very small windows of time.

Illustrating the concepts: a practical example

To better illustrate these concepts, which can seem a bit abstract, let’s consider a straightforward example. Imagine this article as a “service” and you, the readers, as its users. Given that this article is centered around SLOs, one approach to satisfy you would be to correctly write ‘SLO’ throughout the article (well, for the most part).

Let’s take the following illustration of one attempt to write SLO correctly many times:

SLO SLO SLO SLO SLO SLO SLO SLO SLO

SLO SLO SLO SLO SLO SLO SLO SLO SLO

SLO SLO SLO SLO SLO SLO SLO SLO SLO

SLO SLO SLO SLO SLO SLO SLO SLO SLO

SLO SLO SLO SLO SLO SLO SLO SLO SLO

SLO SLO SLO SLO SLO SLO SLO SLO SLO

SLO SLO SLO SLO SLO SLO SLO SLO SLO

SLO SLO SLO SLO SLO SLO SLO SLO SLO

SLO SLO SLO SLO SLO SLO SLO SLO SLO

SLO SLO SLO SLO SLO SLO SLO SLO SLO

SLO SLO SLX SLO SLO SLO SLO SLO SLO

SLO SLO SLO SLO SLO SLO SLO SLO SLO

SLO SLO SLO SLO SLO SLO SLO SLO SLO

SLO SLO SLO SLO SLO SLO SLO SLO SLO

SLO SLO SLO SLO SLO SLO SLO SLO SLO

SLO SLO SLO SLO SLO SLO SLO SLO SLO

Service Level Indicators (SLIs)

There are two straightforward metrics: the total number of times ‘SLO’ was attempted to be written and the number of times it was written incorrectly (meaning, any instance where it is not ‘SLO’).

In this case, there were 144 attempts (16 rows multiplied by 9 columns) and only 1 error.

Service Level Objectives (SLOs)

Earlier, we defined our service level somewhat vaguely as “one approach to satisfy you would be to correctly write ‘SLO’ throughout the article, well, for the most part.”

We can make this more precise using the SLIs we just defined: 99% of our attempts to write ‘SLO’ should be successful.

Availability

Now that we have defined our SLO, let’s try to understand if our example is breaching the SLO we just defined.

First, let’s compute the availability, which is the ratio between the successful attempts and the total number of attempts.

In our example, there were 144 attempts to write ‘SLO’ and only 1 of them was incorrect (written as ‘SLX’). So, the availability is calculated as follows:

Successful attempts = 144–1 = 143

Total attempts = 144

Availability = Successful attempts / Total attempts

= 143 / 144

= 99.3056%

Since the availability (99.3056%) is greater than the objective (99%), we have not breached our SLO.

Error Budget

Now, let’s try to compute the error budget. The error budget represents the number of acceptable errors for a service. The acceptable number of errors can be calculated as:

Acceptable Errors = Total Attempts * (100% — SLO)

= 144 * (100% — 99%)

= 144 * 1%

= 1.44

So, the acceptable number of errors for our example is 1.44, which means we can afford to have 1 incorrect attempt (since we can’t have a fraction of an attempt) before breaching our SLO.

We can define the “remaining errors” as the difference between the acceptable errors and the actual errors.

Remaining Errors = Acceptable Errors — Actual Errors

= 1.44–1

= 0.44

The error budget is then computed as the ratio between the remaining errors and the acceptable errors:

Error Budget = Remaining Errors / Acceptable Errors

= 0.44 / 1.44

= 30.56%

The error budget is another way to check that our SLO hasn’t been breached; if it is greater than 0, then the SLO hasn’t been breached.

In our case, since the remaining error budget is approximately 30.56%, which is greater than 0, we can conclude that our SLO has not been breached.

Note that this way to calculate the error budget is good to illustrate the concept, but in reality, the error budget can be calculated more simply, using just the availability and the SLO:

Error Budget = (Availability — SLO) / (100% — SLO)

= (99.3056% — 99%) / (100% — 99%)

= 30.56%

The pervasive impact of SLOs

How different parts of an organization can benefit from implementing SLOs

Service Level Objectives (SLOs) have become an essential tool in the world of software engineering and operations. They provide a framework for understanding and measuring the quality of services provided. However, their impact goes far beyond just technical aspects.

Implementing SLOs affects various roles within an organization, from stakeholders and product managers to DevOps engineers. In this section, we will explore how these different personas are influenced by SLOs and how they, in turn, benefit from adopting this approach.

By aligning objectives across roles, from stakeholders and product managers to DevOps engineers, it helps in achieving organizational goals more efficiently and effectively.

Personas

While researching SLOs, we identified three personas within our organization that are most impacted by them. These personas are Stakeholders, Product Managers, and DevOps Engineers. Let’s take a closer look at each one of them.

Stakeholder

Represents individuals or groups with a vested interest in the product’s success, including executives, investors, or customers. They are primarily concerned with overall business goals, return on investment, and the product’s impact on the organization’s success.

Product Manager

Responsible for defining the product vision, strategy, and roadmap. They bridge the gap between customer needs, business goals, and technical implementation, working closely with cross-functional teams to prioritize features, enhancements, and improvements.

DevOps Engineer

Develops and implements software solutions according to project requirements and specifications. They ensure the quality, performance, reliability, and scalability of the product, monitor system performance, troubleshoot issues, and ensure the availability of the product.

The multifaceted impacts of SLOs

The implementation of SLOs can significantly impact an organization, particularly the personas we have introduced earlier.

Customer-Centric Approach

SLOs enable a customer-centric approach by setting measurable objectives tied to key user experience metrics.

This focus helps Product Managers ensure the product aligns with customer expectations and gives stakeholders confidence in the product’s ability to deliver a reliable, satisfactory user experience.

Data-Driven Decision Making

SLOs provide quantifiable performance goals that facilitate data-driven decision-making.

Product Managers can prioritize improvements based on user experience metrics, while DevOps Engineers can optimize critical areas impacting the attainment of SLOs, resulting in a more efficient and responsive product.

Incident Management & Risk Mitigation

SLOs act as a baseline for incident management, classifying critical incidents based on their potential impact on SLO attainment.

DevOps Engineers can automate, prioritize, and respond to incidents based on their potential impact on SLO attainment, while stakeholders benefit from this proactive approach as it enables effective risk mitigation, reduces downtime, and safeguards the product’s reputation.

Resource Optimization

SLOs guide resource allocation decisions, enabling product teams to invest resources where they can have the most significant impact on meeting service objectives.

DevOps Engineers and Product Managers have metrics that help them understand where optimization can happen and how optimizations impact a system. This gives stakeholders confidence that resources are efficiently allocated, ensuring a reliable product without unnecessary overspending.

Takeaways

In conclusion, establishing clear and meaningful Service Level Objectives (SLOs) is pivotal for maintaining high service quality, optimizing resources, managing incidents, and ensuring customer satisfaction.

Accurate identification of Service Level Indicators (SLIs), setting achievable SLOs, and monitoring error budgets are essential. This not only ensures that the different roles within the organization are aligned but also aids in meeting the organizational goals more effectively.

Take time to understand the specific needs of your service and customers and use this understanding to set your SLOs. A well-constructed SLO framework is not just about setting targets but is a critical tool for achieving operational excellence and customer satisfaction.

Take time to understand the specific needs of your service and customers and use this understanding to set your SLOs. A well-constructed SLO framework is not just about setting targets but is a critical tool for achieving operational excellence and customer satisfaction.

In the next article of this series, we will delve deeper into how to choose the right Service Level Indicators, define meaningful Service Level Objectives, and discuss the tools available to implement SLOs efficiently. Stay tuned!

This article was written by Adrien Bestel, Principal Ops Engineer @ tb.lx, the digital product studio for Daimler Truck 💚

Read the other SLO series articles:

🚛🌿 If you’d like to know more about how we work at tb.lx, our company culture, work methodologies, tech stack, and products you can check our website and join our journey in creating the transportation solutions of tomorrow through our social media accounts: LinkedIn, Instagram, Youtube, Twitter/X, Facebook. 💻 🔋

We’re hiring!: tblx.io/careers 💚🚛🌿

--

--

tb.lx
tb.lx insider

Developing digital solutions for sustainable transportation 🚛🌿 with Daimler Truck. Privacy policy: https://www.tblx.io/privacy-statement