Making SLOs Fast

How we created a tool that makes practicing Site Reliability Engineering (SRE) easier.

Jim van der Waal
OneFootball Tech

--

This article is collaboratively written by Dominik Garcia (Software Engineer) & Jim van der Waal (Product Manager), who were both part of the team that developed the SLO Creation Tool.

In the last years the SRE movement has got a lot of traction. After the publication of the book the practices have spread within no time. And with very good reasons! We don’t want to get into the theoretical talk about SRE but tell you how we put them into practice at Onefootball.

We noticed it is easy to talk about it but to actually practice it and embed it in your way of working, is way harder. In this article we share the story on why and how we created a tool to make the creation of SLO’s simple for all the product teams. Making it easy for all teams to adopt the most important practice of SRE, building Service Level Objectives (SLO’s). Helping to drive the SRE mindset.

Onefootball & the Core team

At Onefootball, performance plays an important role. With over 11 Million monthly users, all across the world, there is a significant load and we need to build products and services that scale. Which is why there was an increasing urgency in driving SRE practices within Onefootball.

At Onefootball we work with autonomous product teams, each responsible for building and running their applications. This brings a lot of benefits, but it also has its challenges. Because each team has a wide set of responsibilities, with Reliability (just) being one of them. And not every team is (or has) an SRE expert.

This is where the Core team comes in. It has always been hard to label the Core team, but when we started the team we formalized our mission as: “We enable teams to build a reliable and scalable Onefootball platform.” So how do you do that?

We believed that providing the proper tools was just a small component, and an important part of the mission of the team was around the sharing of the knowledge encompassing SRE, and creating awareness about the importance of building reliable applications. You can have the best tools, but without the culture, it will not have any effect on your customers.

Tools vs. Culture, a diagram we used to illustrate that it is not all about having the best tools.

Origin of the idea

We started by giving a good example, defining and implementing SLO’s for the applications we owned as the Core team. Practice what you preach. What we found out is that the theory of Service Level Objectives is relatively simple and easy to understand. But implementing it yourself is another story, where the Devil is definitely in the details.

We had so many (good) discussions within the team. About the meaning of percentiles, the use of error budgets, rolling and fixed time-windows, and so on. And then came the practical discussions on how to translate this into something we could actually start monitoring with our tools. And although these were good discussions, we knew that if we really wanted to make a change in mindset, we needed to make it easy.

That is why we started to think about how we could make the creation of SLO’s easier. Because we still believed this was the best starting point for teams to get started with the SRE practices and get awareness about the reliability of their applications. And if it would be easy and would only take a couple of minutes, then what would be stopping them?

SLO Creation Tool

We tried to think of a way that would use the existing theory but making it more accessible, using our own experiences to make the creation of SLO’s a walk in the park. Introducing: The SLO Creation Tool.

The Requirements

Once we started to discuss the actual implementation we decided that the final solution needed to address a few key points:

  • Short and sweet. It should be simple enough that anyone at Product & Engineering should be able to create an SLO. It was clear to us that if only product managers or only engineers could use the tool it would not work out, because everyone should be involved in the process. As well, it should not take a lot of time to create an SLO. We knew that if people felt productive with it, they would be more likely to use it again.
  • Integration with our tech stack. It was really important to us that the SLO Creation Tool would be able to use tools that we already had to gather the required information and create the SLOs. This way we were not imposing a new tool for the team to adopt, making it easier to integrate into existing ways of working.
  • Review process. As we mentioned previously, we wanted everyone to be involved in the process. Therefore, we decided that there should be a review process for SLOs so that the team that created the SLO could discuss it and make changes until they were happy with the result.
  • SLI Menu. We got this idea from Google’s workshop: The Art of SLOs. For us, it meant making it as simple as possible to select an SLI, just as it is choosing food looking at a restaurant’s menu.

The Solution

We are surely proud of the solution we created. We felt that it managed to address the key points mentioned above. In general, it took less than 5 minutes to be able to create an SLO (excluding the review & discussion). Something that we believe was key to this, is that the tool is divided into multiple steps. This gave us the ability to give users a powerful explanation of what they should do at any point in time, as well as making sure that they would not feel overwhelmed by all the information in front of them.

Introduction. In the next image, you can see the introduction to the tool. In this step, we could give users a good overview of the tool and they could see there were only 5 steps between them and a brand new SLO.

Basic info. In the first step, shown in the next image, we just ask for the basics: application name and user flow name. The cool thing about this step is that we retrieve a list of all the applications that have data on New Relic (our performance tracking solution) so you can easily pick the application you want to create an SLO for.

SLI Menu. As mentioned before, the SLI Menu was a key focus point of the solution. First, we needed users to understand what an SLI is and how to choose one. As well, it was very important for us to not be too ambitious and add too many SLIs at first, which would probably have decreased the quality of the application. Therefore, we went with Latency and Availability as our first SLIs since we believe these are the easiest to understand for anyone that is just getting started.

As you can see in the next picture, we also added a short but precise high-level description, which together with the UI makes it feel like an SLI Menu.

Events. After the SLI Menu, we added the last setup step, before actually creating the SLO. In the Events step, users can select the HTTP methods and paths included to measure the SLO against.

Choosing the SLO. This was by far the step that took us the most time to feel that we got it right. It needed to encapsulate the previous information, follow the SLO specification that we defined, and be able to ask for the required information of each SLI. To achieve this, we decided to split this step into substeps when more than one SLI was selected. This decision allowed us to add more meaningful information for each SLO.

As you can see in the above, below the input field there is a sentence that contains the whole SLO, putting everything together in a very visual way.

The next image shows something crucial about SLOs: 100% is never a good choice. We wanted to share this statement at the moment it mattered most, linking to a section of Google’s SRE Workbook about Reliability targets and Error budgets.

Review. The only thing remaining was showing the user a preview of their SLO, before they submitted it for review. We decided that the simplest solution for reviewing was to use GitHub. It was already integrated with the workflow of Product & Engineering and it has strong review capabilities.

Having an SLO on Github allowed for the team to discuss it and together improve it to a point where everyone was comfortable with it. This process ensures that the whole team is involved since there might be many different ways of measuring an SLI and mostly, it is important to be in sync when setting an Error Budget.

Push to Dashboard. Last but definitely not least, once the SLO was merged, the tool allowed users to create a dashboard for it on New Relic with just a single push of a button. Having a way to visualize an SLO is what really made the teams feel that they achieved something. They could monitor how the application was actually performing against their SLO and tweak the values if necessary.

This also was the most important step to make sure the SLO provided its (future) value and that it was not just a one-time thing. By having it as a dashboard that can be easily reviewed, and displayed in all kinds of places, it really acquired a place in the teams’ ways of working.

Results

Looking at what we wanted to achieve, we definitely succeeded. But as with any software that you develop, there are some things you are proud of and less proud of.

Proud of…

Within just one month, we were able to create a tool that enabled teams to create SLO’s and generate corresponding dashboards in a matter of minutes.

This also paid off in the amount of SLO’s that were created, and the dashboards you would see around the office floor. By giving all teams a TV for monitoring purposes, now SLO’s and performance graphs were popping up on the screens. Which helped to create awareness about Reliability.

An example of the SLO dashboard of the AdTech team.

Now at least each week the team reviews the dashboard and sees if any action needs to be taken. Which is easy — if its red (so below the SLO) — it needs attention. There is no discussion because all of the bickerings if something is (too) slow or if it is really that bad for the user, already happened when you defined the SLO as a team.

Less proud of…

While we put a lot of effort into advocating SRE, it took quite some time. And as a team we still think we should improve. In hindsight our efforts fell short in that we took a pretty bottom-up approach, advocating towards our peers (the other teams), without a lot of pressure from “above”. Only after we created the tool and the creation of SLO’s was acknowledged in our department goals, it really took off.

Another thing is that the tool is crazy scalable, but after all the amount of teams and services that use the tool is limited. Onefootball is not the biggest company, so such a sophisticated tool might be an overkill. But we do believe that the investment was small enough to justify it. And who knows, maybe other companies can take some learnings from it and expand on the idea or maybe even in the future we can think of open-sourcing it?

Conclusion

We hope the article gave you some insights and inspiration from how we improved the SRE mindset at Onefootball, and how we use SLO’s to ensure a reliable experience for our users. To wrap it up, we would like to end with the following key takeaways:

  • SLO’s can be complex but it is possible to make it understandable for everybody. We recommend to start SLOw. Starting with a couple of SLI’s and maybe the “easier” applications. That is why for example we started with the HTTP focused SLI’s and the backend services that already had some instrumentation implemented.
  • If you want to change a mindset, you need to make things easy. Before we introduced the tool we had quite some discussions with teams, and although they fully understood the concept, it was still too complex for them to actually “implement” it and make it part of their Way of Working. By using existing tools and integrations we were able to connect the dots already for them.
  • For any change you need “management” support. Bottom-up change is admirable, but also very hard. You need to address it from both sides. In the end as P&E there was a goal set (top-down) that we defined SLO’s for all critical services. This created at least some urgency and illustrated that management supported it. Conclusively we believe it was still (too) limited to really move towards an SRE mindset, but maybe it was fair to say that as Onefootball it was not the biggest problem that we had.

Always share what you think, and let us know if you are interested to know more about the tool!

References

--

--

Jim van der Waal
OneFootball Tech

Head of Product at Polarsteps | Always up for using creativity and making complex problems simple(r) and fun!