Scaling Reliability: From Operations to Engineering

Published in

Site Reliability Engineering

7 min readOct 6, 2018

Site Reliability Engineering is the new fad. It’s not the Docker that you don’t need, it’s not the Kubernetes that you don’t need. It’s also not the Blockchain that you don’t need. Or well, maybe it is.

Question on running operations at scale, like the manpower or cost involved. Or what is the cost associated with each 9 in the Five Nines are being talked about more often. What does it take to run things smoothly.

Here are a few pillars, or keys to success, of running things reliably:

Have no silos of information
Measure everything
Culture, not tools
Accidents happen
Changes should be gradual

There are multiple interpretations of these pillars.

In this series of posts, we’ll talk about how we interpret these guidelines and go about reliable systems at TrustingSocial.com.

While defining Site Reliability Engineering is probably beyond the scope of this post, we will take the liberty of quoting Google’s definition (https://landing.google.com/sre/) of SRE here:

SRE is what you get when you treat operations as if it’s a software problem. Our mission is to protect, provide for, and progress the software and systems behind all of Google’s public services — Google Search, Ads, Gmail, Android, YouTube, and App Engine, to name just a few — with an ever-watchful eye on their availability, latency, performance, and capacity.

A friend, Nishant, usually says this with a lot of pun and sarcasm:

If nothing ever stopped working, was it never supposed to break Or did the reliability team do a wonderful Job?

The current technology and product evolution can seem a little counter-intuitive. Shipping fast and breaking things is the accepted norm. Or that networks are going to fail. And at the same time we’re seeing Site Reliability Engineering picking up great momentum.

Why couldn’t the dev teams just be a little more careful about service?

In hindsight, it’s quite logical actually. Responsibility and focus.

Let me elaborate. Having a separate team to worry about the uptime and reliability allows the product teams to completely focus on the business logic and feature building. This razor sharp focus grants product teams the ability to release every minute, while the reliability team wants to not even have a minute of downtime. The business as a whole moves really fast but does not break things (well, most of it).

This responsibility division and ownership also brings a different dynamics into play. Opinions collide which sometimes erupt into arguments, as would between any passionate team. And as long as each of the teams can own the authority and responsibility of being available with a mutually agreed upon commitment of their deliverables, it works out great!

Most people think its DevOps++, but here’s the difference. While traditionally operations teams run the code for product teams, we are talking about providing product teams with a tooling that allows them to take code to production and run it reliably. A badly written code without UAT wont automagically fix itself, but the reliability toolchain can certainly ping the developer, every time their deployed application fails.

Don’t give them a fish, give them the tools to fetch a fish, reliably.

In a way, silos are good. Silos of responsibility and authority, not information. And as long as each of the teams can own the authority and responsibility of running their service with a mutually agreed upon commitment of reliability, it works out great!

Just because people don’t complain, doesn’t mean parachutes always work.

Speaking of Reliability, these follow up questions come to mind:

Why should a service be reliable to begin with?
How do you measure Reliability?
How Reliable is Reliable?

These questions are usually met with a raised eyebrow.

And here I quote my friend, Nishant again:

The Best code ever is the one that I never wrote.

There is a lot of wisdom in that statement. The Best code is also probably the most reliable code, because it never breaks.

History and statistics show that every time there is a change, something breaks. The need of uptime of a service (product, when speaking from the perspective of a business) is and should only be defined by the product’s owner. This is the real essence of a Service Level Objective (SLO).

We let the product owner define the uptime for their service and they usually answer: Always, obviously.

What kind of service should always be running?

A pacemaker is a service that should have 100% uptime, but you don’t ship updates to it. I did a quick google search to find that out, and here’s what I got.

If 100% uptime is a myth. So, what is an acceptable uptime?

The only real way to answer this would be through a quick calculation. If your service was down for x hours a day how much would it cost.

Let’s say you are running a Stock Exchange. The trading hours range from 9 am to 5 pm. You simply can’t afford a downtime during the trading hours, and a few hours before and after. But a few minutes of downtime at night might not affect your business. So, more than the uptime, the window of uptime matters here.

A sample Service Level objective could read something like:

100% uptime during 8 AM — 6 PM

But again, 100% uptime even during a window of time, forever, is not completely possible. Hence, we introduce the notion of Error Budget. The uptime agreed upon keeping the operations, on-call people, testing, and development costs in mind. So if the agreed upon uptime is 99%, that gives us an Error Budget of 1%. This is your window of not being responsible. You could choose to do anything else, clear technical debts, binge on pizzas or simply choose not to react. As long as issues remain under 1%.

This brings us to the next question: How much reliable is Reliable? And how do you measure Reliability?

You know how they say, it’s not about being first but being the best, but you also don’t want to come fifth and win a toaster instead.

A server that takes forever to respond, but eventually does, is reliable; it’s probably just not very useful. Reliability is not just about doing a thing reliably, but also doing it in an acceptable time frame.

Measuring Reliability is a painful ask and brings out the topic of Service Level Indicators (SLI).

Service Level Indicators could be one or many of the following:

Latency
Was a given request served within the agreed upon time? Given you have no control over the network Infrastructure used by the requester, this is usually the time between a request being received and returned.
Error Rate
Number of failed requests. Definition of failure, however, is subjective. For an HTTP service this could be all requests with status code 500 and above.
Throughput
Number of valid requests that can be served within a time frame.
Availability
Overall uptime of the system within an aggregated window.
Correctness
A system could be doing all of the above, but the data being served is incorrect. Example a service that computes PI to some obscure places. The value is served within the SLA but the precision is off.
Freshness
Think of another system that does all of the above, but a cache busting logic algorithm seems to be malfunctioning, so it continues to serve stale data.
Elasticity
The system behaves perfectly when there are low concurrent requests but breaks apart the moment concurrent requests exceed a certain value.

Service Level Indicators have two aspects to them: Specification and Implementation.

Specification is provided by Product teams while Implementation is driven by the Reliability Team.

A key thing to note is that Indicators cannot be measured absolutely. For example, you cannot have Error rate since Inception. You need an agreed quantum of time in which the metrics are aggregated. The choice of aggregation function is unique to the product. Say, one could take an average response time or the mean response time. Google and many other companies prefer percentile.

99% of requests were served within 100ms whereas 98% were served within 10ms.

Also, larger the window, the more complex the storage gets. An aggregate over past two years would require two years worth of data points to be reliably retained. Recursive reliability, anyone?

Interval is another important point of consideration. The interval at which these Indicators are captured has to be just low enough. Going too low would cause Artificial Load on the service and going too high would miss important system heartbeats. Here’s an example: an I/O wait check interval of 5 minutes would only get the state every 5 minutes, and probably fail to capture any intermittent disk subsystem wait. But, you do not want to query top 5 seconds either, which may result in top suggesting 100% CPU utilization because of top itself, that is busy doing a stat of /proc/<pids>.

One of the final considerations is that Indicators behave differently under different degree of loads. So one of the agreed upon criterions should be the load or Elasticity of the system. Not all indicators would degrade linearly as load increases. It is important to measure Degradation as well.

In the upcoming posts, we will cover the technology and Codes (Released under OSS by Trusting Social Engineering) that help us achieve the Indicators and Objectives.

Here’s a sample form that we ask Product Owners to fill out for services that qualify to be reliable.

Scaling Reliability: From Operations to Engineering

Written by Piyush Verma