How Engineering Managers Should Balance System Reliability and New Features

Using SLIs, SLOs, and tracking your error budget, you can determine how to balance reliability and new features.

Published in

CodeX

4 min readSep 10, 2022

Photo by Joshua Miranda: https://www.pexels.com/photo/tower-of-wooden-blocks-on-table-4399366/

I was about to join a team that had focused exclusively on building new features for years. Unfortunately, they invested very little in reliability, maintainability, or supporting the application. As a result, the system’s quality left something to be desired.

In a situation like this, going full-steam ahead on features has consequences. But, simultaneously, putting a stop to all new features affects customers and the business. So what’s the right approach?

Using SLIs, SLOs, and tracking your error budget, you can determine how to balance reliability and new features.

You can use an error budget to decide how much to invest in software maintenance.

How do you measure reliability?

Reliability combines many factors, including availability, responsiveness, dependability, and quality.

For each service that you run, there are many metrics that you could use to understand reliability. We call each metric a Service Level Indicator, or SLI, an indicator of some property of the service.

For a web application, we may measure the number of requests made, the response times, return codes, and simultaneous requests.

But how do we know how we’re doing? To begin, we have to accept that nothing is perfect, especially in a software system. There will be slow responses, wrong return codes, dropped network connections, and heavy traffic. Things will break.

How do you know know if you’re successful?

A Service Level Objective, or SLO, indicates the proper level of reliability for a particular metric within a service. Setting your SLOs goes beyond what I plan to discuss today. However, it’s a metric you, your team, and your management can develop and agree to manage.

Once you have your SLOs defined, you can measure whether you are doing better or worse than expected.

You can balance system reliability and new feature development!

An error budget lets you track how the service performs against the SLOs over time. You will have a surplus budget if the service is regularly performing better than your objectives. Conversely, if the system’s metrics worsen, you will be in a budget deficit.

Determining the right SLO for your service is an exercise for the readers.

What should you work on?

Now that you’ve defined SLIs and chosen our SLOs, you can use them to determine what to work on next.

When you are in a budget surplus, your service is performing better than expected. In addition, a budget surplus makes introducing new features into the system safe, even desirable.

If you have a budget deficit, your service is not performing as well as expected. Therefore, you should slow down with new features and spend some time improving the reliability of your services.

Using SLIs, SLOs, and tracking your error budget, you can determine how to balance reliability and new features.

It begins and ends with users.

At its core, you must look at reliability through your users’ eyes.

When you are operating software, the proper level of reliability is your most critical operational requirement. Reliability combines many factors, including availability, responsiveness, dependability, and quality. However, before we dig into how we can understand reliability, a few definitions will help.

To begin, we have to accept that nothing is perfect, especially in a software system.

Users are anyone or anything that relies on your service. In the same fashion, a service is anything having a user. Finally, a system is a set of services working together.

This definition may seem circular, so let’s expand with a simple example.

Imagine a dynamic website used to publish online articles. A writer would come to their browser, type in the address, and see a web application where they could publish articles. The web application, in turn, would persist their writing so the writer could come back and edit their work.

The writer would be the user in this case, and the web application would be the service. Similarly, the web application could be the user with a database server acting like the service. Finally, we could go even deeper, where the database server uses an infrastructure service, such as EC2.

It’s users and services — all the way up and down.

Put it all to work.

SLIs and SLOs are simply data.

Using SLOs is a process, not a destination. You cannot create a project to fix your reliability and then move on. The reliability work will never be “done.” System reliability is an ongoing journey. The world will constantly change, and so will the expectations around your services.

It would be best if you viewed reliability through your users’ eyes.

Keep your humans front and center, track the essential properties of your system, and keep iterating.

You can balance system reliability and new feature development!

👏🏻 Give me a clap and “follow” if you enjoyed this article.

📋 About Milo

I am a tech executive, writer, speaker, entrepreneur, and inventor. I’ve been developing software since 1995 and developing teams for the past decade. 🚀

I write articles about software, engineering, management, and leadership.

You can also follow me on Twitter. 🐦