Reliability Engineer Vs DevOps Engineer

Manan B Shah

Published in

An Idea (by Ingenious Piece)

6 min readDec 15, 2020

DevOps and SRE seem like two sides of the same coin.

DevOps Engineer:

The term ‘DevOps Engineer’ strives to dim this divide between Dev and Ops conjointly and suggests that the best approach is to hire engineers who can be excellent coders as well as handle all the Ops functions.

Site Reliability Engineer(SRE):

According to Wikipedia, “Site Reliability Engineering is a discipline that fuses aspects of software engineering and applies that to IT operations problems. The main goals are to create ultra-scalable and highly reliable software systems.”

What is the difference between DevOps and SRE?

DevOps is all about combining development and operations, defining the behavior of the system and seeing what needs to be done to close the gap between the two teams. The theory behind this title DevOps talks about what needs to be done to make the two teams work as one.

And according to Google, that’s where the main difference between DevOps and SRE lies. While DevOps is all about the What needs to be done, SRE talks about How this can be done. It’s about expanding the theoretical part to an efficient workflow, with the right work methods, tools and so on. It’s also about sharing the responsibility between everyone, and getting everyone in sync with the same goal and vision.

I shall list out down few similarities and differences in both the roles and many of them find it difficult to understand the difference between them and assume that DevOps and SRE seem like two sides of the same coin.

#1. Reduce Organizational Confusions

Large enterprises usually have a complex organization structure, with a lot of teams working. Each team is pulling the product in a different direction, not communicating with the rest of the company and as a result, fail to see the big picture as a whole. This can lead to frustration, a set back in deployment and high costs due to delays.

DevOps’ job is to reduce the confusion, and to make sure there aren’t any teams within teams who are not aligned with the rest of the company. They minimize and bridge the teams into one group, with a shared vision.

SREs don’t talk about how many confusions are in the company, but more about how to get everyone to discuss. This is done by using the same tools and techniques across the company, which in return helps share the ownership across everyone.

#2. Accepting Failure

Although the concept of DevOps is about handling and coping with issues before they fail, failure is something that we, unfortunately, can’t avoid. DevOps embraces this by accepting failure as something that is bound to happen, and which can help the team learn and grow.

In the world of the SREs, this objective is delivered by having a formula for balancing accidents and failures against new releases. In other words, SREs want to make sure that there aren’t too many errors or failures, even if it’s something that we can learn.
This formula is measured with two key identifiers:

a. Service Level Indicators (SLIs) b.

b. Service Level Objectives (SLOs).

SLIs measure the failures per request, by calculating request latency, throughput of requests per second, or failures per request as measured over time. SLOs derive out of this threshold, percentage or number, and represent the success of SLIs over a certain amount of time.

#3 Implementing Quick Change

Companies want to move faster than before. They want frequent releases, continually updating the product and keeping team members on their toes about new and relevant technology.

DevOps are all for this change, but in a gradual and handled way. Both DevOps and SREs want to move quickly, and Google points out that SREs emphasizes reducing the cost of failure as they do so.

#4 Tooling and Automation

One of the main focal points for both DevOps and SREs is automation. Both titles encourage adding as much automation and tools as possible, as long as they provide value to developers and operations by removing manual tasks.

#5 Measure Everything

An automated workflow that moves fast is something that needs constant monitoring. DevOps and SRE teams both need to make sure that they’re moving in the right direction, and they do so by measuring everything.

The main difference here is that SREs revolves around the concept that operations is a software problem, which led them to define prescriptive ways for measuring availability, uptime, outages, toil, etc.

SREs also ensure that everyone in the company agrees on how to measure reliability, and what to do when availability falls out of specification. This includes contributors at every level, from developers, through team managers and all the way up to VPs and executives.

Skills required for a DevOps Engineer:

Knowledge and proficiency with a variety of Ops and Automation tools
Great at writing scripts
Comfortable dealing with frequent testing and incremental releases
Understanding of Ops challenges and how they can be addressed during design and development
Soft skills for better collaboration across the team

Site Reliability Engineer Job Skills

From basic-level site reliability engineer to people working as senior site reliability engineer, everyone on-board focuses on driving high reliability into systems by working closely with software development and IT-operations teams.

Here are some general roles and responsibilities in a site reliability engineer job that SREs need to perform:-

Software Engineering

Site reliability engineers incorporate various software engineering aspects to develop and implement services that improve IT and support teams. Services can range from production code changes to alerting and monitoring adjustments.

The site reliability engineer job also includes tasks like building proprietary tools from the scratch to mitigate weaknesses in incident management or software delivery.

Troubleshooting Support Escalation

Site reliability engineers may have to spend a considerable amount of time fixing cases related to support escalation. They should fully know critical issues to route support escalation incidents to concerned teams. Critical support escalation cases, however, go down as site reliability engineering operations mature.

On-Call Process Optimization

In many organizations, the site reliability engineer job will involve the implementation of strategies that increase system reliability and performance through on-call rotation and process optimization.

Site reliability engineers will also have to add automation for improved collaborative response in real-time, besides updating documentation, runbook tools, and modules to ready teams for incidents.

Documenting Knowledge

As site reliability engineers take part in on-call duties, IT operations, software development, and support, they gain substantial historical knowledge.

To ensure a seamless flow of information between teams, site reliability engineer job may require documenting the knowledge gained.

Optimizing SDLC (Software Development Life Cycle)

Site reliability engineers must ensure that IT professionals and software developers are reviewing incidents and documenting the findings to enable informed decision-making.

Based on post-incident reviews, site reliability engineers will need to optimize the Software Development Life Cycle (SDLC) to boost service reliability.

Final Thoughts

So, is there a difference between DevOps and SREs?

DevOps and SRE can still be confusing at some level but it all depends on the company and your job profile interpretation. The roles and names might vary but the only thing that remains with you is your skills. End of the day, the whole world needs a solution and technology becoming more and more dynamic and enriching day by day, experience and learning matter more than anything else.

DevOps and SRE teams are not so different. Both help combine developer and operation teams, while sharing similar responsibilities and focusing on enabling automation and reliability. The bottom line is that it’s all about the data. You need information in order to understand how to measure success and failure and how to gain continuous reliability across the application.