YAA — Yet Another Article on SRE
S.R.E. is for “Site Reliability Engineering” with a motto
“Hope is not a strategy”
I have embraced these methods and practices in my professional experiences in recent years. So I naturally wanted to put them in place when I arrived at my new company being convinced of their adoptions and their efficiency. The best definition that could be done could be summed up by
SRE = human + processes
which I detail later in this article. But first of all, what is SRE? Like DevOps? What is its origin and values for a company?
SRE originated with Ben Treynor VP Engineer at Google who wanted to end the age-old battles between development teams and operations, as well as withstand exponential traffic loads every day. SRE therefore promotes product reliability, people empowerment and innovation. SRE are coders as a developer but also system administrators focused on software engineering. Typically, 30% of the work of an SRE engineer is focused on operational issues or surprises, the rest of the time being spent on projects and continuous improvement. A team can work on several projects but it is also possible to have several teams on the same project. We will encourage teamwork by forming squads or task forces to solve challenging problems by putting a dedicated focus on a given time with different profiles like developers, SRE or product managers. We share the same recruitment process which usually leads us to recruit a developer named DevOps and for an operational named SRE.
SREs are integrated into the product teams/developers, which allows them to mingle with the team, to gain mutual trust, to understand their problem and to have common methods or tools. We speak the same language. So that a true bridge can take shape between the teams especially in the cases where one must challenge the reliability of an application or correct errors of developments.
I firmly believe that within an organization the size of company, the SRE model is the effective way to evolve an engineering organization towards scalable, cross-functional operational efficiency.
SRE or DevOps? What is the SRE role
SRE follows the DevOps philosophy, but the latter focuses more on automation, while SRE adds the reliability of applications and systems. The SRE approach is more about finding problems and solving some of them by themselves.
DevOps is a philosophy designed to build a healthy working relationship between the operations and development teams
SRE wants to solve a problem that system administrators or developers were not doing a few years ago:
incorporate scalability, reliability, and high availability directly into application development.
Human and SRE
And the human in all this?
Under the term “HumanOps”, is grouped a set of principles that focuses on the human aspects for the proper functioning of the infrastructure and its set.
The health of the infrastructure does not only concern the hardware or the software but also and especially the humans who operate it. There are many purposes, such as reducing the number of alerts, fatigue related to on-call or stress caused by an incident, improving cross-communication, sharing the good and the bad. On this theme and upon my arrival, I set up these individual interviews every 2 weeks with each member of the teams. I asked to communicate widely, sought to ban the heroic aspect, the famous “Hero”, set up a monthly team meeting, tried to reduce the alerts and call-ups by a follow-up and daily actions. In short, simplify, clarify and automate as soon as possible.
Processes and SRE
If there is a guiding philosophy for SRE, it could be summarized by
“We are not firefighters or super heroes”.
We seek to have the view and global control of our platforms, incorporate automation and simplicity as soon as possible. We want to encourage good practices and common tools, improve reliability by SLAs / SLOs. Perfection is difficult to achieve (and not exists in fact) but we are constantly striving to do better than yesterday every day. This is certainly one of the good ways to get close to it. Then, we always start from the premise that nothing is viable, a server can fall, a service too, a hardware failure is our daily.
This is how processes begin. Without them, there is no direction or ways. The teams then feel lost and the velocity and the pleasure of the team collapses. We started by presenting the vision to 9 months to the teams that we have subsequently cut into projects by the methodology of OKRs (Objectives and Key Results). These OKRs helped build kickoffs for launching these projects with associated tasks (“Epics” in the Agile world). These methods make it possible to give a framework while leaving freedom of action and changes at any moment.
We could name all these methods as “continuous correction”
I was given this chance to promote these SRE concepts that the teams really embraced. Nothing is perfect but the dynamics that it represents today seems obvious to us. The silos are shrinking, failure no longer seems so panicky, changes are made with greater serenity, automation is in everyone’s mouths and we now present clear numbers with monthly reporting.