How HappyFresh Implements SLO (SLO Series — Part 1)

Riky Lutfi Hamzah
HappyTech
Published in
3 min readSep 18, 2020

Customer satisfaction is one of the successful keys that we should prioritize. We need to make sure every customer transaction works as expected. Not only the availability of services, but also the performance. Our approaches in measuring customer satisfaction and reliability level of our services are to implement Service Level Objectives (SLO) for every critical user journey.

Service Level Objectives (SLO) and Service Level Indicator (SLI) are well-known reliability measurements that were introduced by Google in their SRE books. At HappyFresh, we adopted some SLIs such as availability and latency to determine the error rate and performance of our services. The SLIs are constructed using event-based calculation instead of time-based calculation.

We classified requests as success or failed events based on the response code to determine the availability of services. While for latency, we use Apdex standard to measure our service’s performance by classifying requests in “satisfied”, “tolerating” and “frustrating” after setting the performance threshold.

Monitoring SLI, SLO, and Error Budget

We utilize Scalyr as a log management platform. By forwarding the service’s access log to Scalyr, we can analyze all the transactions of our services. The access logs have some attributes such as response_code and duration of the transactions so that we can calculate the ratio of “success/good performance” requests all over requests to construct the availability and latency SLI. We can easily do that by using the Scalyr PowerQueries.

Since we implemented composite SLO (not only measuring critical user journey but also every backend service endpoint), we have hundreds of SLIs that we want to monitor. We also want to see how well the current availability and performance compared to the SLO target that was already set. Therefore, Grafana can be used to visualize the result of Scalyr PowerQueries since there is integration between Scalyr and Grafana. We chose Grafana because it has powerful data visualization capabilities from many data sources.

The following Grafana dashboard shows the service’s status in the current time window (7 days start from Sunday). This dashboard helps our teams to identify the consumption of availability error budgets so that our engineers can take actions to prevent SLO breach at the end of the week. When we have enough error budget left, we usually use it for run experiments, A/B tests, releases new features, etc. See Part 2 — Alerting on error budget.

We are using 7 days time window SLO

HappyFresh SLO Dashboard
SLO Dashboard in Grafana

Monitoring SLO Compliance

Other than that, we also create a dashboard to monitor the SLO compliance from previous weeks. It helps us to see the history of our service’s availability and performance. It is also used for reporting purposes to upper management.

Due to the short data retention we have in Scalyr (less than 30 days), we created a small service (we called it SLO Calculator) to get SLI data from Scalyr PowerQueries and ingest it to a dedicated database in Postgres. The service scheduled to run at the end of every week.

SLO Compliance Report / Dashboard
SLO Compliance Dashboard in Grafana

There is no silver bullet on implementing SLO

There are so many different approaches when it comes to implementing SLO. Every organization has a unique way of instrumenting logs, constructing SLIs, setting SLO targets, etc. Some challenges that we had when implementing SLO are how we define and measure the SLIs, what are the right tools that can support it, and how to monitor and operationalize the SLO. Fortunately, the hard work and great commitment from our engineering teams make the implementation of the SLO at HappyFresh run gracefully.

This is Part 1 of SLO Implementation Series at HappyFresh. Leave a 👏 if you enjoyed reading it. We’re also hiring engineers to join us in helping households around South East Asia to get their groceries easily. If you want to know more, visit HappyFresh Tech Career.

--

--

Riky Lutfi Hamzah
HappyTech

Engineering Manager — Reliability & Security at HappyFresh. Writing some thoughts at rilutham.com.