SRE — Dissecting failure on reliability engineering

4 min readMar 29, 2020

This is the #2 of a series of posts about thoughts, experiments and any other kind of what ifs and whatnots. Nothing here is bulletproof or carved in stone — just simple topics and tips to help everyone walk the walk.

Ready for #2? Have you successfully influenced and convinced your organization on the need of Reliability Engineering? Hope you have. Not an easy task, though.

So, reliability, reliable…
Let’s start our approach to this by focusing on the antonym of reliable — unreliable — and build our line of thought from there.

Here’s the definition of unreliable, at least according to google (and if that’s what google says, then it must be true!): not able to be relied upon.
If we browse through similar adjectives in order to dig a bit deeper into this “not able to be relied upon”, we’ll find erratic and fallible.

We’re here now: error, failure.

One thing that most startups don’t get (and most established companies get usually late into the game) is that error and failure goes hand in hand with product development. Even more curious is the fact that common sense tells you this — but yep, people simply tend to ignore it until it’s too late.

Therefore, here’s a simple “if A=B and B=C then A=C” for you:
If there’s no business without product, and if there’s no product when all or some of its services are unavailable (using an “if” here would be a complete oxymoron), then we can conclude that there’s no business if fixing errors and preventing failures doesn’t get the same level of attention as product development does.

Or, using a ternary approach for the more geeky:
got reliability engineering ? business OK : business KO;

But let’s continue on failure — time for definitions:

Before anything else, let’s think about a mundane and (at first sight) totally irrelevant concept that, nonetheless, are obvious markers of reliability engineering implementation in place vs the clear lack thereof:

Have you already defined, very objectively (yes, this really means beyond any possible interpretation or opinion), what are the categories used to describe a failure on your product ecosystem? Always define those categories in a measurable and quantifiable way — always!
Here’s an example for service availability: up, degraded or down — what is the definition for degraded? A +500ms latency instead of 50ms? During how much time? On how many requests?
Here’s another example for incident management: the impact and urgency priority matrix (some ITIL here) — clearly define what constitutes high / medium / low with tangible quantities;
And here’s an example for problem management: let’s use a more mature model of priority categorization for the impact definition: operational and / or financial and / or reputational — a lot to define here, don’t leave it up for interpretation or you’ll find yourself in a perpetual downward spiral in endless hours of inconsequent discussions;

One must try to avoid the “everything’s broken” stigma from hovering our heads and the analysis paralysis that usually comes attached to it.
A streamlined definition is halfway to properly prioritize your focus, especially on not so immediate or simple solving failures. And adding Service Level Objectives and error budgets to the equation will definitely provide you with a clear path based on what you must target first.

Still on failure — measurement and reliability trend status:

There’s more than one way to skin a cat, but if there’s one thing I consider a must have on any cockpit board I’ve ever used is the MTBF — Mean Time Between Failure.
This is a very simple yet extremely useful indicator — it measures the average time taken between failures.
To calculate MTBF, divide the total number of operational hours in a period by the number of failures that occurred in that period.

MTBF = T/R where T = total time and R = number of failures

Quick hints from MTBF:

The smaller the MTBF, the less reliable your system is: you need to act;
The more segmented MTBF indicators you can get (MTBF per application, MTBF per service, MTBF per database, etc..), the better; With this you can compare your different MTBFs and decide where you need to act first;
A rolling window MTBF (quarterly, monthly, whatever — that’s completely up to you) will surface the reliability trend: upon action (and if the action has been effective), you should start seeing a MTBF increase; in the same way, if upon action the trend is flat or negative, this means that you essentially screwed up somewhere in the process (introduced new bugs, not an effective fix, etc…)

Hope this helped someone, somehow. Stay tuned for #3.

SRE — Dissecting failure on reliability engineering

We’re here now: error, failure.

But let’s continue on failure — time for definitions:

Still on failure — measurement and reliability trend status:

Written by Ricardo Cosme