SRE — Morphing into site || systems reliability engineering

Ricardo Cosme
3 min readNov 23, 2019

--

This is the #1 of a series of posts about thoughts, experiments and any other kind of what ifs and whatnots. Nothing here is bulletproof or carved in stone — just simple topics and tips to help everyone walk the walk.

So you think you can SRE?

Couple of questions then:

  • Have you and your organisation embraced failure as something that is not about the “if” it happens but about the “when” it will happen?
  • Do you believe that your co-workers always try to do the right thing, have or are trying to develop a reasonable dose of common sense when addressing the work at hand and never, under any circumstance, are responsible for causing other than honest mistakes when such mistakes occur?
  • Can you state that your organisation and your co-workers started working on or have already adopted a blameless mindset approach to failure analysis?

If there’s a “no” in the answers to the questions above then this is for you.

Let’s dive into failure and the need to embrace it.

I fail, you fail, we all fail. The exact same goes for systems, software, hardware. It’s not a question of “if” — by the way, if you strongly believe that you don’t or won’t fail, you better start thinking and looking for a different line of work.

Everything is prone to fail and, as such, it will fail, eventually. That’s why we have SLAs in the first place — have you ever thought about it? Again, have you ever thought about it? Unfortunately, and for so many times, I keep listening to people using buzzwords or industry jargon like “SLAs”, and simple questions like “why did you have to come forward with a SLA in the first place?” don’t get a proper answer, when they get an answer at all. I usually mentally restrain myself to hop into the next question “was that SLA negotiated or imposed?” in shear fear of what the answer may be…

The SLA is the primal acceptance contract between two entities (the provider and the consumer) that since everything is prone to fail (a service, a process, etc), the provider is committing to make everything he possibly can to maintain whatever he provides above or at the agreed level, and the consumer is committing to accept a worst case scenario on which whatever he is being provided with can be down to the agreed level.

So, here’s a tip: design to failure tolerance. Prepare yourself, your teams, your co-workers, do everything you can inside your organisation to be failure tolerant by embracing it in the first place. Don’t think “if”, think “when”.

Be it software or systems, public cloud or on-prem, embed or distributed, start asking your engineers that hard additional question: “how does your app or system behave when everything or something around it is either down or not available?”. This is what reliability is all about — how do you fail? And keep pushing it, over and over, until you don’t have to ask the question anymore.

Here’s another tip: start every operational meeting or conversation with “I fail, you fail, we will all fail” and prepare yourselves to do it gracefully, while recovery is already working its magic — don’t set yourself aside of that ethos, get your co-workers to live by it and to spread it all around your organisation.

The more people you onboard into embracing failure, the closer you get to morph into reliability engineering.

--

--

Ricardo Cosme

20+ years of experience in complex tech environments and large, high availability systems. But, most importantly, proud father of a girl and two boys.