Balance of Reliability and Innovation

Alex Potter Dixon
Glasswall Engineering
3 min readNov 17, 2019
SRE Balance of Reliability and New Features

The Balance

At Glasswall one of our new products is our SAAS FileTrust email solution and two questions regularly come up:

  1. Is our service stable enough so that our users are having a good experience?
  2. When can we get this new feature out?

We want our users to feel like the service is mature, stable and safe to use otherwise why would they use it? We also want our service to be seen as exciting and engaging by evolving it with new features on a regular basis.

The problem is the more new features and innovations you bring to your online service the more instability it can bring!

So how does Glasswall solve this problem? SRE

SRE

Glasswall has recently established an in house SRE team which stands for Site Reliability Engineers. Our main purpose is:

Make the site as reliable as possible while keeping the pace of innovation high.”

Our goal is to hold the balance and allow developers to create all the new exciting features for our customers but also making sure the customer doesn’t experience a degradation of service.

There are 3 simple tools that we use to make sure this happens: SLO, SLI and Error budget.

SLO

What is an SLO? It is a Service Level Objective. A bit like an SLA, it's a metric that SRE negotiates with the business on what is an acceptable level of service.

An example of an SLO might be:

99% of requests to our site are successful.

This simply means out of 100 user requests to the site 99 must be successful and 1 can be unsuccessful such as a 500 error.

SLI

We need to be able to measure an SLO and this is where SLIs come into play. Service Level Indicators are the ways we measure our SLO.

Following on from the SLO example our SLI might be:

Successful requests (2**,3**) / Total number of requests.

This will generate a percentage that signifies what our SLO is.

Error Budget

Error Budget is the final piece of the puzzle. It is simply the gap between 100% SLO and your defined SLO. The Error Budget gives you the breathing room to deploy changes to your site, it’s the control mechanism for the balance we’ve been seeking.

Following on from the SLO example our Error budget is:

100–99 = 1%

This means that we have a 1% Error Budget for our requests to be unsuccessful.

Error Budgets are also designed to fit over a specific time period and within Glasswall we currently run our Error Budgets on a rolling 4 week period.

How does this all fit together?

Now we have all the tools we need, how do they fit together to answer our two previous questions:

  1. Is our service stable enough so our users are having a good experience?
  2. When can we get this new feature out?

Let's use the examples we’ve been using but with the added detail of 4 weeks traffic:

4 weeks traffic: 10,000

SLO: 99% of requests to our site are successful.

SLI: Successful requests (2**,3**) / Total number of requests.

Error Budget: 100–99 = 1% or 100 unsuccessful requests

Now we know we have a 100 unsuccessful requests Error Budget we know what is an acceptable level of service.

In budget or out of budget

If the service is in budget we keep developing new features and deploying them in new releases. If we are out of budget then we need to STOP.

If the Error Budget is blown at Glasswall we will halt all changes and releases other than P1 issues, security fixes and bug fixes to our production systems until the following has been completed:

  • Postmortem has been completed.
  • Bugs from the postmortem have been fixed in production.
  • Other actions from the postmortem such as process change have been enacted.
  • Service is back within its SLO.

Using this mechanism we can control the flow of new features going to the platform while making sure the customer is experiencing the level of service they expect.

--

--

Alex Potter Dixon
Glasswall Engineering

SRE Manager at Glasswall. Always looking to innovate and push in cloud infrastructure and SRE.