Operationalize SLO (SLO Series — Part 3)

Riky Lutfi Hamzah
HappyTech
Published in
3 min readNov 16, 2020

This is Part 3 of HappyFresh SLO series story. Read Part 1 to get an overview of how HappyFresh implements Service Level Objectives (SLO) and Part 2 on how we stay alert on the error budget.

The final stage of implementing Service Level Objectives is to build a long-term process to get value out of it.

Having a bi-weekly meeting with cross-functional tech leads and CTO to review SLOs

We’re not going to get it right in the first time and that’s okay. We need to have an iterative mindset to get the ideal SLOs, thresholds, and teams in place. Patience and persistence are important.

We already have internal operational review meetings where we look at our key reliability metrics such as the number of incidents, incident retrospective completion, and customer-reported issues. After implementing SLOs, we also have a bi-weekly session for reviewing our objectives through the SLO dashboard.

Photo by Chris Montgomery on Unsplash

Capturing commentary and discussion around SLO violations or trends

Tech leads go through their service’s SLO dashboard and describe what happened in the past two weeks, such as what has already been done and what they will do to maintain compliance over their SLOs.

By capturing commentary when there are SLO violations, we can correlate every event, changes, and/or incidents to it. Also, we can prevent SLO breaches in the future by having a discussion on SLO trends.

Capturing, assigning, tracking follow-up action items from SLO violations

When we violate our SLO, we are affecting our users and customers. Those issues must be treated as incidents and define the right severity levels for SLO breaches.

We conduct a postmortem meeting for every major incident. Not only to find the root cause of it, but also to decide the follow-up action item that should be done to prevent the incident from happening again in the future.

Negotiating Technical Work Versus Development Velocity

By having visibility of error budget and SLO trends, development teams and SREs can negotiate to do the technical work first over releasing new features. It can be improving application performance, paying the technical debt(s), or reviewing current architectures.

In the context of an incident, working on the follow-up action items is the highest prioritize of teams and SREs. Since this should be agreed upon with the product managers, it would be better to invite them when setting the priorities.

Reviewing critical upcoming initiatives collaboratively

Determine if any planned updates or deployments are likely to exceed our error budget and sprint plans accordingly to prevent this. Attendees in this meeting should be the product managers, SREs, the core component/service engineering teams, and other stakeholders (if needed).

As an example, currently we are migrating some critical services to a containerized environment. We should make sure these infrastructure changes do not burn too much of our error budget. We’re doing some practices such as load testing, comprehensive capacity planning, and canary release to mitigate the risks.

This is Part 3 of the SLO Implementation Series at HappyFresh. Leave a 👏 if you enjoyed reading it. We’re also hiring engineers to join us in helping households around South East Asia to get their groceries easily. If you want to know more, visit HappyFresh Tech Career.

--

--

Riky Lutfi Hamzah
HappyTech

Engineering Manager — Reliability & Security at HappyFresh. Writing some thoughts at rilutham.com.