Make running systems at scale fun & embarrassingly easy.

In an earlier Post, I talked about how SLOs can be misleading, and the Service Level Indicator in consideration was Uptime. There is another SLI which is almost impossible to be accurate about, Latency.

Like Uptime is measured as % and aggregated over a month/year/week, based on time window choice, Latency is for a unit of time (ms and s.), and the preferred aggregate is percentile.

The purpose of this post is to debunk common mistakes that I did while dealing with Percentiles.

Why is it important to understand percentiles in depth? Because one of the critical Indicators of software…


An illustrated summary of Developers -> DevOps -> SRE

1. Developers wanted to ship their produce

To the other side

Image for post
Image for post

2. Production never matches the development environment. It resembles, but cannot match.

So they deployed people on the other side


We ran a poll on Twitter

“Do you care about the quality of your infrastructure code?”

Image for post
Image for post

And on Reddit


Image for post
Image for post

Wikipedia defines Root Cause Analysis (RCA) as “a method of problem-solving used for identifying the root causes of faults or problems.”

Essentially, root cause analysis means to dive deeper into an issue to find what caused a non-conformance. What’s important to understand here is that Root Cause Analysis does not mean just looking at superficial causes of a problem. Rather, it means finding the highest-level cause- the thing that started a chain of cause-effect reactions and ultimately led to the issue at hand.

Root cause analysis methodology is widely used in IT operations, telecommunications, healthcare industry, etc. …


Image for post
Image for post

SLO is an acronym for Service Level Objective. But before I explain SLO, you need one more acronym SLI (Service Level Indicator)

An SLI is a quantitative measurement of a (and not the) quality of a Service. It may be unique to each use-case, but there are certain standard qualities of services that practitioners tend to follow.

  • Availability The amount of time that a service was available to respond to a request. Referred to as Uptime
  • Speed How fast does a service responds to a request. Referred to as Latency
  • Correctness Response alone isn’t good enough. It also matters whether…

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store