Site Reliability Engineering — An Introduction

Ash Powell
Glasswall Engineering
5 min readSep 6, 2020

At the surface level, Site Reliability Engineers (SRE) apply software engineering concepts to challenges with infrastructure and tasks, to build highly scalable and efficient solutions as the north star.

What is an SRE?

As the world moved online, website usability, cloud software, and cloud computing have become a vital business imperative — from e-commerce businesses to multinational banks and search engines for all.

It’s changed how we handle applications and their workloads. Today, we rarely think of expensive, high-touch, powerful servers, but instead rack upon rack of data centres pooled together by virtualisation, with distributed software architecture preventing server outages from causing downtime.

The emphasis has changed from hardware to software-defined architecture and from manual processes that are unreliable and prone to error to stable, efficient, and repeatable automated tasks.

Site Reliability Engineering is the discipline of managing the programmable framework and optimising the performance of the tasks running on it. The Site Reliability Engineer job title emerged in Google’s halls, which aimed to reshape the partnership between software developers and operations workers at the turn of the millennium — and help them work together to create stable, scalable systems, with quality enhancements and automation as key principles.

The history of SRE at Google

Identifying SRE concepts back to their Google roots in the early 2000s offers a critical disciplinary object lesson.

“When I came to Google, I was lucky enough to be part of a team that was partly made up of people who were software developers and who were willing to use the software as a way to solve problems that had been solved by hand previously. So, it was normal to take the ‘all can be handled as a software issue’ approach and go with it when it was time to build a structured team to do this operational work” Ben Treynor said in a discussion on Google’s official blog.

“So SRE is essentially doing work that has previously been performed by an operating team but using software specialist engineers and depending on the fact that these engineers are both naturally predisposed and capable of replacing human labour automation” Treynor adds.

Google is still thinking very rigidly about how an SRE team can be put together. All Google site reliable engineers must either be Google computer programmers or “applicants who are very near to the credentials of Google Software Engineering.” They must also have expertise in infrastructure management, most probably “Unix system internals and networking (Layer 1 to Layer 3).”

SRE credentials also tend to differ from enterprise to enterprise but Google’s methodology is a good starting point as far as basic concepts go. The specifics would depend on the organisation’s business needs, existing processes, and already implemented tech stack.

Core job responsibilities of an SRE

Any good SRE is going to get obsessed with one specific thing: automation.

As Jason Qualman, an SRE at monitoring software provider New Relic, says in a blog article: “Much of this job is thinking about unnecessary and time-taking stuff that people do and putting an end to them as fast as possible. You’re thinking, ‘I’m going to take the time to automate this right now and save someone else from having to do this unpleasant thing’ instead of kicking a can down the road on manual work.”

Another key component of the SRE position is something called “release engineering” which includes identifying best practices to ensure effective and repeatable software releases.

Essentially, SREs must know best how to track processes and respond when things go wrong, constantly writing and rewriting solution playbooks to minimise the time required to repair any failure that might occur. At Google, this includes reporting an occurrence, identifying the root factors that contribute, and implementing potential prevention steps.

“Writing a post-mortem isn’t retribution — it’s an opportunity for the whole organisation to learn,” Googlers John Lunney and Sue Lueder write in a contributed chapter of the Site Reliability Engineering book.

The difference between SREs and DevOps engineers?

I know what you think. That all sounds a lot like DevOps, but as far as terminology is concerned, the SRE job description pre-dates the engineer by around five years.

Both are founded on similar concepts but the difference is subtle as well as significant. Both methods of working include breaking down barriers between developers and operations personnel and both seek to improve the efficiency of developer teams while retaining the central stability of those services.

The main distinction is that DevOps engineers prefer to concentrate on promoting continuous development and velocity of developers, while SREs assume responsibility for stability and automation over the program lifecycle, with a focus on successfully delivering and tracking launches and keeping software-defined architecture humming.

Why SRE is Important?

Your company is still not sure that it should follow SRE? Let’s look at some aspects, which distinguish the role of site reliability engineer from other functions.

· Working with other experts, design teams, and consumers to establish goals and behaviour. That helps to ensure availability of the system. You know very clearly when action should be taken because you have decided on the uptime and functionality of a device. This is achieved by employing Service Level Indicators (SLIs) and Service Level Objectives (SLOs).

· Implementing error budgets to help you quantify risk, and thus balance functionality and creation of features. Having an error budget ensures that failure is recognised as natural and that it is not important to provide 100% availability. With no arbitrary expectations set for reliability, a team has the versatility to produce system upgrades and improvements.

· SRE believes in cutting back on toil. Thus, it aims to automate activities that require a human operator to operate on a device manually. Google assumes, for example, that only 50 per cent of the time spent by reliability engineers on-platform will go to coding. The other 50 % were for current applications for feeding and daily care.

· A site reliability engineer should have a holistic understanding of both the infrastructure and system connections.

· Ensuring that any site issues are diagnosed early to lower the cost of failure.

· Since SRE aims to solve problems between teams, the hope is that the SRE teams, as well as the production teams, will have a holistic view of repositories, front end, back end, servers and other elements. And mutual ownership means that anyone team cannot hold single components jealously.

Should You Implement SRE for your Startup?

Well, yes, but note that every company has to tackle all kinds of different issues.

A company may not have the resources to employ a dedicated reliability team or have the time to reflect on any SRE practices. What is worth considering is that you don’t have to fully implement any of SRE’s methods and principles to get at least some of the advantages.

Although one important practice that I would suggest starting to implement immediately is SLOs. This single number will encourage a great many team discussions. Start thinking about the numbers you can look at now, you can then find out if you spend more time than you need to build the ideal CI / CD pipeline or if you are doing too many manual tasks each day.

You may have bigger problems, but you need to start calculating. It’s important to get feedback after each implementation or release. So, concentrate on that.

--

--