An illustrated summary of Developers -> DevOps -> SRE

1. Developers wanted to ship their produce

To the other side

Image for post
Image for post

2. Production never matches the development environment. It resembles, but cannot match.

So they deployed people on the other side

Image for post
Image for post

3. But this process was slow, they wanted to deploy faster

So they deployed Continuous-Deployment (CI/CD)

Image for post
Image for post

4. To improve reliability, we got SRE to do this

SREs’ first job was to hold this ship, but that’s all where they got stuck at

Image for post
Image for post

5. What Site Reliability Engineers should’ve built is

SREs should’ve been *engineering* and *observing* the bridge, but instead they became the bridge.

Image for post
Image for post


We ran a poll on Twitter

“Do you care about the quality of your infrastructure code?”

Image for post
Image for post

And on Reddit

Image for post
Image for post

That’s an approximate and staggering 60–30–10 split.

What do you think will the response be if the poll was — “Do you care about the quality of your product code?”

Reasons

We asked a follow up question to reason why ~30% are in the Somewhat but mostly no category and gleaned these reasons from Twitter and Reddit:

  1. Someone manually created the legacy infrastructure. No one questioned the practice or broke the tradition.
  2. Organizations at a small enough scale might feel that it is faster to deploy infrastructure from the cloud provider console than to codify it. …


Image for post
Image for post

Wikipedia defines Root Cause Analysis (RCA) as “a method of problem-solving used for identifying the root causes of faults or problems.”

Essentially, root cause analysis means to dive deeper into an issue to find what caused a non-conformance. What’s important to understand here is that Root Cause Analysis does not mean just looking at superficial causes of a problem. Rather, it means finding the highest-level cause- the thing that started a chain of cause-effect reactions and ultimately led to the issue at hand.

Root cause analysis methodology is widely used in IT operations, telecommunications, healthcare industry, etc. …


In an earlier Post, I talked about how SLOs can be misleading, and the Service Level Indicator in consideration was Uptime. There is another SLI which is almost impossible to be accurate about, Latency.

Like Uptime is measured as % and aggregated over a month/year/week, based on time window choice, Latency is for a unit of time (ms and s.), and the preferred aggregate is percentile.

The purpose of this post is to debunk common mistakes that I did while dealing with Percentiles.

Why is it important to understand percentiles in depth? Because one of the critical Indicators of software performance, Latency, is measured using Percentiles. Somebody cannot deal with something as clinical as performance without understanding the behavior of its yardstick.


Image for post
Image for post

SLO is an acronym for Service Level Objective. But before I explain SLO, you need one more acronym SLI (Service Level Indicator)

An SLI is a quantitative measurement of a (and not the) quality of a Service. It may be unique to each use-case, but there are certain standard qualities of services that practitioners tend to follow.

  • Availability The amount of time that a service was available to respond to a request. Referred to as Uptime
  • Speed How fast does a service responds to a request. Referred to as Latency
  • Correctness Response alone isn’t good enough. It also matters whether it was the right one. …


Image for post
Image for post

Engineering teams like to move fast. Multiple products with a variety of projects, deployments happen almost every day. Ensuring that the chaos does not take over sanity and lead to multiple failures requires carefully thought out processes. And while most of the teams have processes in place — right from the dev stage to the production stage — ensuring that the processes are followed becomes a challenge.

Image for post
Image for post

There are checklists in place before a new update is pushed to the production stage. …


Image for post
Image for post

Site Reliability Engineering is the new fad. It’s not the Docker that you don’t need, it’s not the Kubernetes that you don’t need. It’s also not the Blockchain that you don’t need. Or well, maybe it is.

Question on running operations at scale, like the manpower or cost involved. Or what is the cost associated with each 9 in the Five Nines are being talked about more often. What does it take to run things smoothly.

Here are a few pillars, or keys to success, of running things reliably:

  1. Have no silos of information
  2. Measure everything
  3. Culture, not tools
  4. Accidents…


I was recently writing an application in Golang which required some Database interaction. The db library I was using had inbuilt Pooling so I didn’t have to bother about connection recycling and reusing, as long as I could initialise a DbPool and continue to call DbPool.new(). Having a module level Singleton object of DbPool would do this trick. However the problem with Singletons is that, in a multi threaded environment, the initialisation must be protected to prevent re-initialisation. I will discuss a few common ways to achieve this, along-with the shortcomings of each approach.

Module init()

Most common approach I have come across is to define an init() functions in module files. These module level construtors perform operations like DB Pool initialisation or caches. It is guaranteed that this code runs once-and-only-once at startup of your program. …


Preface

Recently, I have been trying to bring up virtual machines in Microsoft Azure but ran into this interesting & annoying problem of not being able to upload SSH keys via the terraform DSL. There is a provision to provide a ssh_key_thumbprint but sadly no way to upload what you would call a KeyPair in AWS jargon.

While terraform does not support this operation via its DSL, It is possible to achieve this using some less-explored features of terraform.

Solution

I am using OS X, so my code samples might include some OS X specific commands. …


Terraform is a pretty nifty tool to layout complex infrastructures across cloud providers. It is an expressway to overcome the otherwise mundane and tedious task of going through insane amount of API documentations.

The output of terraform runs is a JSON which carries an awesome lot of information that the cloud platform provides about a resource; like instance_id, public_ip, local_ip, tags, dns, security groups etc and often it has left me wondering If I could search/access these JSON document from configuration management recipes, playbooks, or modules.

Example: While provisioning a zookeeper instance, I wan the local-ip of all the peer nodes. I could run a query that would fetch me local_ips of all the nodes in this VPC that have the same security group. Or while applying a security patch to all the Redis nodes, I need the public-ip of all nodes that carry the tag `node_type: redis`.
I hope you get the idea of use cases by now and It definitely sounds like something that a document DB should be able to handle with relative ease. …

About

Piyush Verma

CTO/Founder @last9inc | Startup magnate (2x fail, 1x exit) | English Breakfast Tea, Hot

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store