How Google come up with the Site Reliability Engineering(SRE) role?
Historically, companies have hired sysadmins to run complex systems.
This systems administrator approach involves building existing software components and deploying them to work together to produce a service. Then, the task of the system administrator is to run the service and respond to events and updates as they occur. With the increase in system complexity and traffic, the occurrence of incidents and updates has increased accordingly, the sysadmin team needs to grow to absorb the additional work.
The sysadmin model of service management has several advantages. For companies that decide how to run and configure services, this approach is relatively easy to implement: as a familiar industry example, there are many examples to learn from and imitate. The relevant talent pool is already widely available. A set of existing tools, software components (off-the-shelf or non-off-the-shelf) and integration companies can be used to help run these assembled systems, so novice sysadmin teams do not need to reinvent the wheel and design the system from scratch.
Direct and indirect costs of the sysadmin approach
The sysadmin approach and the accompanying development/ops split has a number of disadvantages and pitfalls. These fall broadly into two categories: direct costs and indirect costs.
Direct costs are neither subtle nor ambiguous. As services and/or service traffic grow, it becomes expensive to run services with teams that rely on manual intervention for change management and incident handling, because the size of the team will inevitably expand with the load generated by the system.
The indirect cost of the division of the development/operations department may be small, but it is usually more expensive for the organization than the direct cost. These costs are due to the large differences in the background, skills, and incentives of the two teams. They use different vocabulary to describe situations; they have different assumptions about the risks and possibilities of technical solutions; they have different assumptions about the target level of product stability. The division between groups can easily become not only an incentive, but also one of communication, goal, and ultimately trust and respect
Therefore, traditional operations teams and their product development departments often fall into conflict, the most obvious being the speed at which software is released to production. The core of the development team is to launch new features and hope that they will be adopted by users. Essentially, the operations team wants to ensure that service will not be interrupted while holding the pager. Since most downtimes are caused by some kind of change, such as new configurations, new feature releases or new user visits, the goals of these two teams are fundamentally strained.
Google’s solution: Site Reliability Engineers
It was 2003 and Benjamin Treynor Sloss joined Google. As one of his first tasks, he was asked to run a “Production Team” of seven engineers. His previous experience until then was in software engineering. So he designed and managed the group the way he would want it to work if he worked as an SRE. That group has since matured to become Google’s present-day SRE team, which remains true to its origins as envisioned by a life‐ long software engineer.
SRE is what happens when you ask a software engineer to design an operations team
SRE Building blocks
The main component of Google’s service management approach is the composition of each SRE team. Overall, SRE can be divided into two main categories.
50–60% of people are Google software engineers, and they are hired through the standard procedures of Google software engineers. Another 40–50% of candidates are very close to the Google software engineering qualifications and have a set of technical skills that are useful for SRE but are rare for most software engineers. So far, UNIX system internal knowledge and network expertise are the two most common alternative technical skills they seek.
What all SREs have in common is the belief and ability to develop software systems to solve complex problems. Within SRE, they closely tracked the career development of the two groups, and so far, they have found no actual difference in performance between engineers in these two fields. In fact, SRE teams have different backgrounds, which usually leads to the formation of a smart, high-quality system, which is clearly the product of a combination of multiple skills.
The result of their approach to hiring for SRE is that they end up with a team of people who:
- will quickly become bored by performing tasks by hand
- have the skill set necessary to write software to replace their previously manual work, even when the solution is complicated
SRE eventually shared academic and knowledge backgrounds with other development organizations. Therefore, SRE is basically doing the work done by the operation team in history, but using engineers with software expertise, and based on these engineers’ innate ability to use software to design and implement automation, and have the ability to design and implement automation.
How SRE team is being managed?
Through design, it is essential that the SRE team focus on engineering. Without continuous engineering, the operational burden will increase, and the team will need more people to keep up with the workload. In the end, a traditional operations-centric team will grow linearly with the scale of the service: if the product supported by the service succeeds, the operational burden will increase with the growth of traffic. This means hiring more people to perform the same tasks again and again.
To avoid this fate, the team responsible for managing the service needs to write code or it will drown. Therefore, Google has set a 50% upper limit on the sum of all SRE tickets, calls, manual tasks, and other “operations”. This upper limit ensures that the SRE team has enough time in the plan to make the service stable and operational. This upper limit is an upper limit; as time goes by, the SRE team should leave it to its own equipment to work. In the end, it only takes a small amount of operating load and can almost complete the development task, because the service can basically run and repair itself.
Google’s rule of thumb for SRE
Google’s rule of thumb is that the SRE team must devote the remaining 50% of its time to do development. So how do they enforce this threshold?
First, they must measure the way SRE time is spent. With these metrics, they can ensure that the team always spends less than 50% of their time on development work to change their practices. Usually, this means transferring some of the operational burdens to the development team, or adding people to the team without assigning other operational responsibilities to the team. Consciously maintaining this balance between maintenance and development work allows them to ensure that SRE has the bandwidth to participate in creative and autonomous engineering, while still retaining the wisdom gathered from the operational aspects of running services.
Google SRE’s method of running large systems has many advantages. Because SRE directly modifies the code in the pursuit of making the Google system run on its own, the SRE team is characterized by rapid innovation and wide acceptance of changes. Such teams are relatively cheap-supporting the same services of an ops-oriented team will require a lot of people. Instead, the number of SREs required to operate, maintain, and improve the system scales linearly with the size of the system. Finally, SRE not only avoids the dysfunction of the dev/ops department, but this structure also improves their product development team: the easy transfer between product development and SRE teams cross-trains the entire team and improves developers.
Despite these net gains, the SRE model is characterized by its own distinct set of challenges. One continual challenge Google faces is hiring SREs: not only does SRE compete for the same candidates as the product development hiring pipeline, but the fact that they set the hiring bar so high in terms of both coding and system engineering skills means that their hiring pool is necessarily small.
Main 3 questions to become a Google’s SRE
Per an interview that was done with Benjamin Treynor Sloss, they are mainly looking at the following 3 points.
- Do they have a natural tendency to automation?
- Can they handle complexity? (Scalable system’s design)
- Do they have a curiosity about how things work?
Site reliability engineering represents a major breakthrough in existing industry best practices for managing large, complex services. Initially out of familiar motives “As a software engineer, this is how I want to spend my time on a series of repetitive tasks” it has become:
- a set of principles
- a set of practices
- a set of incentives
- areas of effort in the larger software engineering discipline