SLOs at BlaBlaCar
Discover how we set up SLOs at BlaBlaCar and the steps we followed from design all the way to technical implementation.
BlaBlaCar operates a web-based application serving over 90 million members and aims to become the go-to marketplace for shared road mobility.
Being able to monitor the platform from a quality standpoint is key to support our user experience and enlighten business decisions. At BlaBlaCar, the SRE team is a cross-functional team designed to provide tooling and expertise to empower service teams on the “you built it, you run it” path. As part of its mission, the SRE team is responsible for the observability stack and monitoring frameworks.
For years, the engineering teams simply tracked infrastructure and application metrics, raising alerts when errors occurred… but moving from an on-premise monolithic architecture to a microservice architecture hosted on the cloud the context changed, requesting a new pattern.
Having a service-oriented architecture, we needed to define for each of the distributed services a commitment towards the platform. By doing that, we wanted to help the service teams to balance delivery velocity versus reliability: that’s exactly what Service Level Objectives (SLOs) are intended for, by defining a threshold that describes the expected “quality” of a service.
To evaluate how SLOs could improve our reliability and how to implement them at BlaBlaCar, the SRE team bootstrapped a working group. I’ll describe through this article how we set up SLOs at BlaBlaCar, step by step from initial discussions to the technical implementation.
Before starting, let’s begin with some definitions :
- Service Level Indicator (SLI) is what you actually measure. Example: availability, latency.
- Service Level Objective (SLO) is what you have internally set as a target. It’s defined from a SLI, with a target over a period of time. Example: “99.9%” availability over 1 month (the so-called « three 9s »).
- Service Level Agreement (SLA) is an agreement that includes penalties if the objectives are not met. We don’t use them at BlaBlaCar.
- Error Budget provides a clear objective metric that determines how unreliable a service is allowed to be within a single SLO window.
BlaBlaCar’s service teams mainly consume HTTP backend services, so we started with two easy SLOs: HTTP availability and latency. The initial objective was to select a few services, start to implement SLOs, build dashboards and let the system run for a couple of months…. So how did we get started?
It started with a chat
Those two questions were the starting point :
- Who are your users? Is your service being consumed by another service consumer or an end-user?
- What are your users facing? What are their expectations about this particular service?
The key here is to talk with product managers, users, and developers to understand how the service is built and consumed. Thanks to this discussion, you will be able to identify the service boundaries and select a couple of SLIs, to start with. Don’t hesitate to write a literal sentence explaining what the SLI is supposed to measure. It helps to check if the SLI actually means something to the users, which is key.
There is no need to be exhaustive at this early stage: start small and iterate over time to improve the initial coverage.
We implemented fast
We have identified a couple of SLIs, but how are we going to measure them?
Speaking about HTTP SLIs, we had several options: either capture the metrics from the load balancer, from the service mesh, or from the application itself. Since the SLOs are going to be tracked at a company level, we needed to define a consistent way to measure HTTP requests across all the services. In this context, application metrics are not coherent enough leaving us with two options:
- The load balancers. They are a great option but in our context, they only provide logs and we didn’t want to manage a whole pipeline of log processing to create metrics.
- The service mesh, Istio. It provides substantial and consistent observability metrics, it is already deployed and supported by the Foundations team… let’s keep it simple and use it!
Using Istio metrics we built a few PrometheusRules with the help of jsonnet to calculate our SLIs along with a bunch of useful labels to allow us to graph and combine our SLOs.
The above rule is evaluated each minute and computes a time series that indicates whether the service is available or not.
Depending on the service, you will need to carefully select the traffic. 4xx, 3xx HTTP codes can all be right or wrong traffic.
From a user perspective, 4xx codes are errors, but is it an error the provider should be held responsible for? If for instance, a request is not authorized due to a token expiration: 4xx is a normal response that can’t be considered a failure.
On the 3xx codes side: those queries are usually very fast, capturing them could lead to misrepresenting the service user latency.
Last but not least, it is sometimes useful to combine real user traffic with a probe …or even to only rely on a probe to capture metrics.
For example, after implementing the HTTP SLIs/SLOs, we decided to work on MariaDB: there was some friction about the database response time that needed to be clarified. To avoid any subjective feelings and only rely on the facts and figures, the Database Reliability Engineering Team wrote a probe to track the queries' response time. Thanks to this probe, SLI is representing the response time for a known request. Having an unchanged complexity on the input side (queries), we are able to define a smart objective for the measured output (response time).
Having a meaningful indicator helped a lot: if query performance is not good while the SLI’s is showing expected metrics… then the query complexity is to be reviewed.
At this stage we had meaningful indicators, clearly defined with all stakeholders, and we were able to measure them. That’s a good start but there is still some work!
We struggled to define SLOs
Coming back to our two HTTP SLOs (latency, availability ): when we started the working group we honestly had no real clue about which targets shall be set and how we would measure them. There are two classic options: volumes-based or time window-based. Picking one has a lot of implications and needs to be considered carefully.
The first option would be to define a “success ratio”. Such SLO would basically be the target percentage of requests matching the success criteria.
Doing that, we would end up with SLO statement like these:
We want to have 99% of good (eg <5ms response time) requests over the month
We have 1% of failed requests per month in our monthly error budget. The error budget could be down to 500 or up to 3 million requests per month depending on the service traffic.
Another option is to use an “incident ratio”.
Representing the month as a multiple of small-time buckets (one minute for instance) we would evaluate the success ratio for each bucket and consider how many buckets have a success ratio above the objective.
In this case, we would end up with SLO statement like these:
99% of the time the service must have 90% of successful (eg : <5ms response time) requests or above
We have 7h downtime per month in our monthly budget
Each method has its own advantages and drawbacks. With the success ratio method, the SLO score varies in real-time and is directly correlated with the user activity. Having a huge spike of well-handled requests yesterday will minimize the impact of an outage during a low activity period today. That’s perfectly fair. It’s however harder to define a proper error budget policy and thus make decisions easily understandable by all stakeholders.
Using windows based SLOs, the main drawback is the non-proportionality towards end-users impact. Using our previous example, a time bucket with a 100% failure rate weighs as much as one with 15%. Similarly, the weight is not proportional to the number of affected users: a downtime at night has the exact same weight as downtime during a peak period. As a consequence, such SLOs could be noisy for low request workloads.
After a lot of discussions, we decided to adopt time-based SLOs, as they are more meaningful for our users and stakeholders.
We leverage the drawbacks positively: this method is more challenging for BlaBlaCar’s engineering teams than the volume one. Agreeing on a statement as 99% of the time the service must have 90% of successful requests or above means that, if it doesn’t have at least a 90% success rate, we consider it as down for one minute, consuming the error budget. This is a positive exigence towards our system reliability and our end users' experience. However, to avoid noisy alerts (and ensure smooth on-call turns) we introduced a minimum number of queries per seconds for each time bucket, in order to make it eligible for SLO evaluation. As a consequence, if an outage occurs during the night with only a very few bad requests it doesn’t violate our SLO and wakes up an engineer.
This approach combines the best of the two worlds and fits our needs.
Finally, regarding the availability and latency thresholds, we have decided to keep it simple. We set SLOs consistently with the ratios observed on the platform. We needed to get used to this new framework without adding any pressure on the service teams. From the beginning, we decided to set up a control loop to adjust the SLOs in cooperation and transparency.
At this stage, we have measured indicators and defined objectives, that we agreed to review regularly. But stopping here would not have brought any evolution to day to day management of the platform.
We set alerts meaningful for service users
Nowadays, engineering teams are running hundreds or even thousands of servers, building and using software designed to manage server (or even network) failures. As a consequence, it does not make sense to wake up the on-call folks as soon as a pod (or a server) crashes but move to a new paradigm: you should wake up on-call engineers when the user experience is downgraded and alert on symptoms (when your error budget is burning) rather than on a cause that might have no consequence.
The Google SRE’s book introduces several alerting techniques which come along with their pros and cons. At BlaBlaCar, we adopted the multi-burn rate and multi-window alerting because we found it more accurate and efficient to avoid useless alerts while still preserving the user experience.
Actually… Multi-burn rate and multi-window alerting is simpler to setup than advocating for its usage 😅 . I have to admit that I practiced a lot before feeling very comfortable with the figures and being able to explain it fluently. To help the service teams, we explicitly wrote in front of every alerting graph how the alerting is processed.
However, even if the initial presentation was difficult, quickly a team made my day by switching from their legacy alerting to the new SLOs based one !
To sum up, remember that a SLO alerting system is not intended to help you to find an outage’s root cause. It is designed to alert you about a user (end-user, internal service…) having a poor experience using your service, requesting further investigation (logs, CPU & memory consumption, pod rollout …).
Spread the culture … and go further!
As we added a lot of new SLOs, enriched with labels, we are able to build a nice status map of every SLO we manage.
That’s easy to do with the Flant Grafana plugin and give a fancy result:
This kind of dashboard spreads the SLO’s culture all over your company, brings questions, gives insight to everyone about what is important for our users.
Setting up SLOs was a nice step, but we want them to be actionable, otherwise they are just like any other technical KPIs. As a consequence, the SRE team is currently implementing a policy template to help businesses, product owners and developers having relevant conversations when SLOs are not met.
Breaking a SLO can happen. SLOs are not designed to be blameful nor to do finger-pointing. Using them shall help to take a step back and analyze from facts and figures the user experience. Thanks to them, you might find out what could be done to improve this experience and have enriched discussion when talking about backlog prioritization.
The policy itself is finally quite simple, it is divided into two main parts:
- An agreement between the dev team, business team and product team on when to slow down the features releasing versus reliability work. Depending on your organization, it could be evaluated on a monthly or quarterly basis.
- A SLO control loop to ensure the objectives are still relevant: are the SLIs still representative? Are the SLOs ambitious enough?
Never-ending story
SLO is a useful toolbox, which you will always need to adapt, and can’t be applied as a magic recipe: it must be adjusted according to your context, your business, your organization, your technologies, your culture… All of these have an impact on the implementation, and as your business, organization and technologies will evolve, the SLOs will have to evolve frequently as well.
Setting up SLOs was an awesome experience. From a SRE perspective, it allows me to sharpen the global vision of our architecture; and it made a lot of sense to help business and engineering teams to rely on facts and figures while addressing the user-perceived quality.
It takes time to spread the SLO culture at every level of the company. But it is worth it.
Special thanks (in no special order) to Jean Baptiste Favre, Maxime Fouilleul, Nicolas Salvy and Blake Faulds for their sharp and kind comments in reviewing this article.