Why SRE is important

Adservio, IT quality experts.
ADSERVIO
Published in
9 min readMay 31, 2021

What is SRE ?

Site reliability Engineering (SRE) today has become an important pillar in narrowing the gap between developers and IT operations. It represents the solution between what actually happens in the software and what we want to happen to guarantee an excellent user experience. It is a discipline that covers a set of principles and practices to solve infrastructure and operational problems.

The concept was created to handle all complex operations to be solved by technically reasonable solutions. SRE at Google centers on securing, sharing and creating programs and frameworks for all of Google’s open services, which is additionally a confirmation to high accessibility, more reliability and better performance of their services.

Ben Treynor, who is the founder, defines SRE in this interview as:

“Fundamentally, it’s what happens when you ask a software engineer to design an operations function…So SRE is fundamentally doing work that has historically been doe by an operations team, but using engineers with software expertise, and banking on the fact that these engineers are inherently both predisposed to, and have the ability to, substitute automation for human labor.”

SRE Goals are:

  • Monitore and improve the reliability of systems
  • Anticipe failures and Solve them
  • Automate Operation tasks
  • Speed and Performance

History Background :

According to SRE’s book, before SRE google used a system administrator approach to ran its operations. System Administrators worked on the operations aspects, whereas engineers worked on the development side. However, due to the fact that both groups had different backgrounds, skills and different perspectives, this approach caused a division and a conflict between the two. Developers were only concerned about creating new features while operation members were concerned about the efficiency and reliability of softwares and this is how SRE was Born.

The story of SRE started in 2003 when Benjamin Treynor, the founder of the term, was in control of running a small group of 7 engineers. The group was designed to make sure that google websites/services were accessible, dependable, and highly reliable.

As Benjamin was a Software engineer, he managed the team in a way that developers and IT operations were part of one team. He did this by letting the team spend a part of their time on operations tasks so that they have a better understanding on how things work at their end and to see the process of production. As he said:

SRE is “what happens when a software engineer is tasked with what used to be called operations.”

As a result this team became the first SRE team and SRE is now considered as the bridge between developers and IT operations.

When they started, google was much smaller and SRE was a group of few people. Today there are more than 2500 employees in the world.

SRE team was created to align with the grow of developments, as Andrew Widdenson ( SRE @ google ) said :

Our work is like being a part of the world’s most intense pit crew. We change the tires of a race car as it’s going 100mph

Importance of SRE :

Monitoring and improving the reliability of systems:

Monitoring is the core of the development of a business. It helps you save money and time. That’s why it is so important to choose your software wisely. SRE teams’job is to anticipate and alert on situations that requires more attention. Their ultimate goal is to build up a tool to help solve the diagnosed weaknesses.

This is a bit different of what DevOps does, which is basically a responsive approach to infrastructure to enable faster development of new products. In other words and to distinguish any confusion between the DevOps and SRE, they complement each other. DevOps is about the theory and philosophy while SRE is a perspective way of achieving that philosophy.

Troubleshooting escalation issues :

SRE teams are responsible to provide a good customer service. They are also excepted to be dealing with technical incidents, taking care of escalations cases and cooperating directly with clients. Yet, the goal here of a SRE team is to reduce critical incidents and manual work. This solution will give IT support and developments teams a push to care less about escalations and concentrate on implementing new services and building new features.

Case Study: Adopting SRE Principles at StackOverflow

In this section we will cover the experience of StackOverlow when adopting SRE principles at StackExchange.com/StackOverflow.com. An experience shared by Tom Limoncelli a former SRE member at Google who now works in Stack Exchange (home of ServerFault.com and StackOverflow.com).

For those who do not know what Stack Echange is, it is a website like Quora that covers many Questions/Answers. The difference between the two is that Quora is one website while Stack Exchange is a series of many websites, one field of knowledge each. It is considered to be the biggest, most trusted online community for engineers/developers to build their knowledge, share their information, and construct their careers.

In 2012 and rather than aiming to be the biggest software, Stack Exchange set out a SRE team to make their website more speedy and to guarantee a better user experience. Now the question How they did that?

The SRE team’s work was divided into two categories and as Mr Limoncelli called them easy and hard ones. We will first tackle the easy one and then the hard ones.

As they were a small team with a low budget, the firm considered the Developers and SRE’s members of one group, there was no separation of budget, they had a common staffing pool for SRE and developers.

When handling Outages/Incidents: they aimed for a maximum of two events per oncall shift. The purpose here is to well understand the origin and source of an incident to anticipate and prevent any other issues. Every outage is an opportunity of learning and improving the system.

They succeed on this by holding a post-mortem after every incident and by sharing the results with the member of the team and with Devs as well. This was very important for the team as Tom Limoncelli puts it :

Now we will cover the hard category:

One of the most hard challenges they encountered is SLA-Driven Operations and Monitorings. Defining SLAs was difficult as:

  • They are Complex.
  • Requires communication with many departments: Engineers, Product management…

They managed monitoring by building they own monitoring tool called www.bosun.org. The first monitoring system with an IDE ( Integrated development environment ) for designing and developing complex alerts and that can create formulas for these laters. Additionally to that, the system uses OpenTSDB for storage, rewrote agent in Go and works with Linux+Win, and can imply Graphite as a back end.

This tool helped in many aspects :

  • IDE made it easier for Developers to make their own alerts about things they didn’t understand in performance and which eventually SRE members will find a solution for it.
  • And easy with the library for developers to know what DATA must be collected

SLAs made it easier to know the goals behind an action/issue and to know what the gaps are, what to prioritize and when to stop. As Tom Limoncelli puts it :

It isn’t a service if it isn’t monitored against an SLA. If there is no SLA-Based monitoring then you are just running software.

This step is important and indispensable, SLAs are considered to be the basics of managing a service.

The last point was Controlling Operations Overload, in this stage we will discover how Stack exchange handled requests. They managed to handle the flow of requests by:

  1. Running systems that can handle transactions to gain time. For example, they didn’t create monitoring rules for devs, they created a system that drives devs to be self-sufficient. Eventually, devs can develop they own monitoring rules and can select what needs to be collected by SREs.
  2. Hiring only coders, Cap SRE operational load at 50% and share 5% Ops work with Devs teams.

Finally, thanks to this strategy of SRE, they speeded the process and the efficiency of the websites and eventually were able to delete the ticket system. Instead they pointed one SRE member to each development team as a sort of representative. Moreover, the deployment of this steps was a success and it created more time for SRE team to focus on service development, optimization and on other projects.

SRE Teams :

SRE teams are made of software engineers who uses tools (Git, Jira, Maven…) to improve the reliability of their systems. Being a SRE today is demandable but not easy as it requires the following hard and soft skills:

  • Passion for technology and an excellent background in software developments and IT operations
  • Excellent verbal and written communication skills
  • The ability to direct and solve problems
  • Believe in Team work and mutual trust
  • Capacity of self-teaching

For those who are considering becoming an SRE member, here is a SRE Compatibility Quiz:

  1. Do you like communicating and sharing what you know with others?
  2. Do you enjoy doing analysis, thinking outside the box and eventually performing an in-depth analysis of the possible risks that a service may encounter?
  3. Do you like watching services under production and to analyze their performance?
  4. Do you manage to work under pressure and to stay calm when needed?
  5. Do you dislike manual work and enjoy automating everything?
  6. Can you spend time rotating on on-call responsibilities ?
  7. Can you manage incidents by using monitoring and metrics?
  8. Are you passionate about engineering, softwares and eager to know anything new?
  9. Do you enjoy team work?
  10. Are you someone who usually thinks about improving his knowledge?

If you are a software engineer of thinking of becoming an SRE, then I encourage you to pursue it. SRE is indeed a new discipline and is actively evolving. It is the ideal combination of skills for tightening the relationship between IT and developers, resulting in shorter feedback loops, improved communication, and more dependable applications.

Conclusion:

Site reliability Engineering is today the core to successful business and it is believed that in the upcoming years it will become one of the most popular software teams used to maintain the efficiency and reliability of systems.

It has grown to be a worldwide community, and the future as Benjamin Treyor the founder of SRE said is unpredictable

“ I have a lot of Ideas but the truth is I don’t really know for certain. Now that SRE had grown to be a global community and not just a google thing, the future of the profession is determined by the progress and innovations of everyone in SRE. And, if you had asked me in 2004 what would have become of the SRE team, I would not have predicted this. I thought it was going to be a very small niche part of computer science. So what I do predict is that the community of smart and motivated people who now make up the global SRE will continue to come up with new ideas, they will continue to innovate and to advance the state of the art and advance what is possible “

References Links:

https://authority-tech.website/why-sre-is-important/#more-133 http://adservio.fr/performance.html

http://adservio.fr/doc/PERFORMANCE.pdf

https://worksplatform.com/en/what-is-no-code/

--

--

Adservio, IT quality experts.
ADSERVIO

An international company that is based in PARIS. We direct and help companies to build efficient and reliable IT architectures. www.adservio.fr/.