Evolution: From DevOps to the SRE paradigm

Published in

Globant

13 min readApr 28, 2023

This article is intended for an audience that would like to have a more clear path and competencies when moving from a classic DevOps role into SRE (Site Reliability Engineer). As part of the content, we will be covering general guidelines, challenges, and considerations to achieve a successful transformation.

Looking at the Research of Gartner Group in 2022 by the end of 2027, about 75% of the companies will be using organization-wide SRE practices to optimize product design, costs, and operations. You can see the current scenario improvements toward SRE:

Source: Global SRE Pulse 2022 — DevOps Institute — devopsinstitute.com/global-sre-pulse

DevOps vs SRE

DevOps is a culture that seeks the cohesion of development and operations areas, seeking to break down the silos generated by isolating these areas. The team is responsible for developing, assuring quality, keeping safety, deploying, and supervising.

DevOps as a Service gives the ability to get DevOps frameworks and tools fast and on-demand for each project in private, hybrid, and public multi-cloud environments.

DevOps principles

DevOps is an approach to software development that emphasizes teamwork and effective communication between development and operations teams. To achieve this goal, there are several key principles that are crucial to follow as shown in the picture.

SRE principles

Site Reliability Engineering (SRE) is a collection of principles and practices that blend software engineering concepts with infrastructure and operations expertise to address operational challenges. The primary objectives of SRE are to design and implement software systems that are scalable and exceptionally reliable.

SRE leverages operations data and software engineering to automate IT operations tasks and accelerate software delivery while minimizing IT risk.

The SRE practice is based on the following key principles:

The above principles bind with the mindset in the picture below:

To keep the operations running as efficiently as possible, it’s essential to build a trust and blameless culture across the organization.

SRE is not only technical knowledge; engineers to stakeholders need to accept and motivate others to create a cultural change.

Automated everything will be driven by having every resource deployed using the concept of infrastructure as a code (IaC) and having the correct Continuous Integration and Continuous Delivery tools.

SRE aims to minimize the impact and cost of failures by implementing engineering principles, automation, and monitoring to quickly detect and resolve issues. Some strategies to achieve this goal include designing fault-tolerant systems, automating recovery processes, conducting blameless post-mortems, using monitoring and alerting mechanisms, embracing testing and deployment best practices, and implementing cost optimization measures. These approaches help to reduce the likelihood and impact of failures while improving system reliability and cost efficiency.

The table below shows the more relevant differences between a DevOps and a Site Reliability Engineer:

SRE Benefits

Site Reliability Engineering (SRE) practices offer numerous advantages for organizations aiming to enhance the reliability, availability, and performance of their software systems. SRE promotes proactive monitoring, incident response automation, and continuous improvement by incorporating software engineering practices into operations. These practices can result in improved system uptime, faster incident resolution, increased scalability and efficiency, better collaboration between teams, risk mitigation, and a culture of continuous improvement. These benefits can lead to higher customer satisfaction, reduced downtime, increased operational efficiency, and improved overall business outcomes:

Reduced Mean Time to Recovery (MTTR): The SRE team is tasked with maintaining the availability and stability of production systems. In case of a bug or production failure, SRE teams can initiate rollbacks to a previous stable version of the product to minimize the Mean Time to Recovery (MTTR) and restore normal operations promptly.
Reduced Mean Time to Detect (MTTD): The SRE team also endeavors to minimize the Mean Time to Detect (MTTD) by implementing canary rollouts, which enable early detection of issues with a limited number of impacted users. This approach allows the SRE team to swiftly identify and address problems in their initial stages, reducing the overall MTTD and improving system reliability.
Automated Functional and Non-Functional Testing in Production: The Core Development team can automate functional and non-functional testing in test and stage environments, but not in production. Site Reliability Engineers (SREs) can assist in implementing automation testing in production environments without disrupting the end users, thus ensuring that testing and monitoring can be conducted safely and reliably.
Automated Everything: Automation is one of the greatest challenges the SRE team will encounter. It’s a common scenario to see rollouts and supporting tasks being done manually, leading to inconsistency and increasing the probability of human error. A part of the best practices for managing the infrastructure is IaC (Infrastructure as Code) with tools like Terraform and CloudFormation, plus some additional automation tools such as Puppet, Chef, and Ansible.
Automation is widely regarded as a critical element in reducing manual labor, resulting in faster and more efficient internal processes.
Toiling can be detrimental to the productivity and well-being of SREs, as it can lead to burnout and hinder their ability to focus on more strategic and impactful work.
Some SRE toil best practices are:
- Automate repetitive tasks
- Use self-healing mechanisms
- Invest in monitoring and alerting
- Implement Infrastructure-as-Code (IaC)
- Continuously improve processes
On-Calls and Incident Documentation: Part of being a reliability engineer is to take the on-call duties and management of unexpected incidents, being responsible for preparing the documentation of it and adding detailed troubleshooting steps, giving others performing on-call duties a valuable source of information. The SRE team can develop a valuable repository of incident-related knowledge to enhance the efficiency of incident troubleshooting.
Shared Knowledge: Creating and maintaining a comprehensive knowledge base encompassing the entire product development ecosystem, including development, testing, staging, and production, can greatly benefit reliability engineers in anticipating potential issues in the production environment. Regular updates to the knowledge base by SRE together with DevOps can bridge the knowledge gap between the teams, leading to improved reliability and smoother operations.

SRE Operational model

The operational model of SRE differs a lot from the traditional model approach.

Looking at a traditional model the developers create an application and once their code has been committed, they typically view their work as done. On the other hand, sysadmins take charge of deploying the build artifacts (which could be just the code, in the case of an interpreted language) to production servers. Their primary responsibility is to ensure the application runs seamlessly and oversee the production environment as a whole.

The SRE model has a different approach, having the following characteristics:

Integrated DevOps Construct — Integrated Dev and Ops results in no handovers and a common focus.
Proactive Assurance of Stability and Continuous Improvements — Teams focus on proactively identifying improvement areas and ensuring assurance of stability.
Highly Skilled and Multi-Skilled Team — Having a team of highly skilled Site Reliability Engineers allows increased automation and system improvements.
Faster Resolutions and Collaborative Working Model — Automation and collaboration become cultural traits, increasing teams’ operating efficiency and achieving faster resolution using automation and collaboration tools.
Transparent Operations & Decisions based on Service-Level Objectives — Increased transparency on incident summaries, self-service reporting, and SLO-based joint decision-making based on facts.

The image below shows a simplified SRE model.

Starting the Journey

When organizations embark on the journey of implementing Site Reliability Engineering (SRE) practices, they open the door to a transformational approach for enhancing the reliability and performance of their software systems. By embracing SRE, organizations can improve system stability, availability, and efficiency, leading to enhanced uptime, faster incident resolution, improved scalability, streamlined collaboration, effective risk mitigation, and operational excellence. As organizations begin their SRE journey, they set the stage for elevated customer satisfaction, minimized downtime, increased operational efficiency, and optimized business outcomes.

SRE Practices Adoption

Look inside cultural impact in your organization: The SRE model is a framework that emphasizes the importance of both software development and IT operations teams working together to achieve greater reliability, scalability, and performance in the software systems they support.
Select the appropriate model: To choose the appropriate SRE model for an organization, factors such as its size, complexity, and specific requirements need to be considered. There are several SRE models available, including fully integrated, hybrid, embedded, and platform SRE models.
Automation: Implementation of Automation, Minimizing manual systems, and work to focus on efforts that bring long-term value to the system.
Ownership and Knowledge share: Share ownership with developers using the same tools and techniques across the stack.

Additional Concepts

There are some concepts that have to be familiar with in order to be successful when moving into SRE practices.

This section will help the readers to fill the gaps about Service Level definitions before moving into the implementation section. It can also enrich the reader’s comprehension of the subject matter by introducing them to related concepts and terminology.

Service Level Objective (SLO): SLO is a performance target set by a service provider to ensure the desired level of quality and reliability of their service. It is a quantifiable measure that outlines the expected performance or availability of the service over a specified timeframe.
Service Level Agreement (SLA): SLA is a formal agreement between a service provider and its customers that outlines the terms and conditions for delivering the service. It includes specific commitments and guarantees related to service performance, availability, and quality, and serves as a contractual framework for managing and measuring service delivery.
Service Level Indicator (SLI): SLI is a measurable indicator used to evaluate the performance or quality of service. It typically involves tracking and monitoring specific metrics or data points. SLIs are used to assess the actual performance of the service against the defined SLOs and provide insights for managing and improving the service quality.

In essence, SLOs are the performance targets, SLAs are the formal agreements that define service terms, and SLIs are the measurable indicators used to assess service performance. Together, they form a framework for ensuring and managing the quality of services provided by a service provider.

Implementing SLO’s

Implementing Service-Level Objectives (SLOs) involves a systematic approach to defining, measuring, and managing the performance and reliability of a service or system. Here are some key steps to effectively implement SLOs.

Reliability Targets and Error Budgets Definition
SLI definition
SLI Implementation and measuring
Define starter SLOs based on SLIs
Stakeholders’ agreement
SLO and Error Budget policies (Documentation)
Automating Data Collection
Tools
Dashboards
Reports
Monitoring and alerts

Note: Keep in mind that the SLO’s targets are live, which means they are subject to changes to achieve improvements.

Monitoring Strategy

Metrics and structure logging are the two data sources that we consider the best for SRE monitoring purposes. The main scope of monitoring, from an SRE perspective, is to:

Alerts on conditions that require attention.
Investigate and have a diagnosis for the identified issues.
Display information about the system visually.
Obtain insight into trends related to resource usage or service health for long-term planning.
Have a comparison of a system before and after changes are applied.

Key Features of Monitoring Strategy

The key features to be considered when defining a monitoring strategy are:

Speed (refreshens of data and the speed of data collection).
Calculations (based on percentile, long-term window).
Interfaces (display time series data in graphs, multiple chart styles for data).
Alerts (define different categories of alerts to trigger proportional responses).

Incident response model

The main roles in incident response are:

Incident Commander (IC) — Leads the incident response.
Communications Lead (CL) — Report to the IC
Operations or Ops Lead (OL) — Report to the IC

When a disaster occurs, the individual who initiates the declaration of an incident usually assumes the role of Incident Commander (IC) and oversees the overall state of the incident. If roles are not explicitly assigned, the IC assumes those responsibilities by default. The IC also focuses on effective communication, maintaining control of the incident response, and collaborating with other responders to resolve the incident.

At some point, the IC may choose to pass on their role to another team member and take on the Operations and Logistics (OL) role themselves or assign the OL role to a different team member.

The Operations and Logistics (OL) team collaborates with the Incident Commander (IC) to respond to incidents by utilizing operational tools to mitigate or resolve the issue. Simultaneously, the Communication Liaison (CL) serves as the public face of the incident response team, responsible for providing regular updates to the incident response team, and stakeholders and managing inquiries related to the incident.

Both the CL and OL may lead specialized teams to effectively manage their respective areas of incident response, which can be scaled up or down as necessary. In case the incident size decreases, the CL role can be absorbed back into the IC role.

Incident Lifecycle

The SRE incident life cycle is a framework for managing and resolving incidents that affect the reliability and availability of software systems. It typically involves the following stages:

Incident Postmortem Culture

The purpose of a Postmortem Report is to describe in detail the steps and actions that were most effective and those that need adjustments. The main goals of the document are: Inform obstacles encountered, types of actions that were taken to resolve the issue and prevent it from happening in the future if possible.

The outcome of the postmortem incident should be focused on the most effective processes and negative processes, identify what needs adjustment and provide Action Items.

Data Pipelines and Analysis

To collect and process Data analysis, we will need to define what needs to be done. To achieve that purpose, Data Pipelines must be defined:

Data transformation/event processing pipelines: The extract, transform, load (ETL) model is a common paradigm in data processing: data is extracted from a source, transformed, and possibly denormalized, and then “reloaded” into a specialized format.

The transformation phase can serve a variety of use cases, such as introducing changes to the data format to add or remove a field, aggregating computing functions across data sources, and applying an index to the data, so it has better characteristics for serving jobs that consume the data.

Machine learning pipelines: Machine learning (ML) applications are used for a variety of purposes, like helping predict cancer, classifying spam, and personalizing product recommendations for users.

As part of the best practices, you must define the following objectives. Based on SLO, SLI, and error Budgets concepts:

Data freshness
Data correctness
Data isolation/load balancing

Considerations

As part of the process, you must define a Maturity Matrix that will allow you to get a clear picture of where you are standing (use the below table as an example):

Basic Maturity Mat example — Image by author

Based on the current status of maturity, you will be able to define your goals and objectives that will be a guide path to achieve the mindset change that needs to be done for SRE.

Having a clear definition of the Roles and Responsibilities of the team, plus clear communication with other teams in your organization is a must that you need to have, if not, your path into SRE will fail.

Automation of processes is one of the most important considerations that will increase reliability and reduce manual intervention.

As part of the culture, you have to establish Collaboration is key to evangelizing within your SRE team and other teams.

Common mistakes

During your journey into SRE, there are common mistakes that can occur as a result of not having a clear understanding or definition of the path to follow.

During the journey, you are going to start, you could face some of these common errors:

Overlooking the need for a cultural shift: Neglecting the cultural shift required for SRE involves fostering a collaborative and proactive culture that encourages open communication, shared ownership, and a focus on reliability.
Insufficient monitoring and observability: Not investing enough in proper monitoring and observability, can hinder the ability to detect and resolve incidents promptly and proactively identify potential issues.
Excess of Reliance on Automation: Placing excessive reliance on automation without considering the need for human judgment and expertise, which can result in blind spots and potential issues if human intervention is overlooked.
Neglecting feedback loops: Failing to establish effective feedback loops between development, operations, and SRE teams, which are crucial for identifying and addressing issues promptly on time and driving continuous improvement.
Lack of clarity in roles and responsibilities: Not clearly defining and communicating the roles and responsibilities of SRE team members, leading to confusion and duplication of efforts, or responsibilities falling through the cracks.
Ignoring failure prevention: Neglecting proactive measures for failure prevention, such as conducting thorough root cause analysis and implementing preventive measures, can result in recurring incidents and missed opportunities for improvement.
Lack of continuous improvement: Failing to regularly review and optimize SRE practices, tools, and processes, which can lead to stagnation and decreased effectiveness over time, as SRE is an ongoing process that requires continuous improvement.

Final thoughts

To conclude, adopting SRE practices may present some challenges in implementation, but the benefits make it worthwhile. One of the key advantages of SRE is that it promotes a culture of collective responsibility, with everyone on the team working together to address failures. This approach avoids singling out individuals as the cause of the problem and instead focuses efforts on resolving the issue and preventing future incidents through proper documentation and implementation of necessary measures.

Moreover, implementing SRE can also foster a new mindset that prioritizes collaboration and communication, leading to better team dynamics and higher morale. Taking a long-term view can also help to reduce technical debt and minimize unplanned incidents, allowing for a greater focus on innovation and experimentation.

In summary, SRE is a valuable framework for modern software development and operations, offering a holistic approach that prioritizes reliability, scalability, and automation. By embracing SRE principles, teams can build more resilient systems and deliver better outcomes for their organizations and customers.