Breaking down SRE/DevOps into 5 key areas

HD De Leon Barrios
Globant
Published in
4 min readJul 13, 2021

Versión en Español

The division of responsibilities between an SRE and a DevOps is very blurred, although there are many who defend that DevOps is not a position, it is a culture, or a work methodology, we see more and more “DevOps requested” announcements. Position that ends up occupying someone who comes from being SysAdmin, Developer, Networks or even a Telecommunications implementer.

Developers were responsible for features and wanted to move faster, when Operator were responsible for stability and wants to move slower. So Devops, the new guy of the block is taking a huge and confusing responsibility so I pass here to break down the basics of an SRE/DevOps into 5 key areas:

1.Reduce Organizational Silos: The concept of “silos” in companies and organizations is understood as the inability to work efficiently between the areas or business units that comprise them. The absence of good communication, feedback, guidance from the leader and the team can bring not only a bad work environment but also that the processes are not efficient. We can reduce these silos by breaking down barriers between teams and highlighting collaboration with each other.

We are not competition among ourselves, we are a team, if it goes well for you, me too. Well, both the Infra department and the developers share the success of the final product, we must all have the same vision and approach when working in production.

2. Accept accidents and failures as normal: Computers are unreliable, you cannot expect perfection. And when we introduce humans into the system we should expect even more imperfection. When there are failures (which there always are) instead of blaming the person who pressed the button, the processes should be reviewed, if the person followed the procedure in hand then the error is not the person, it is the procedure.

Regardless of what we discover, we understand and truly believe that everyone did the best job possible based on what was known at the time, the resources available, and the given situation.

We can do an RCA or post mortem analysis without pointing out culprits. Rather, we make sure that accidents or failures don’t happen in the exact same way more than once. And there are failures that are taken as normal because an error budget is handled within which a system crash is acceptable.

3. Implement Changes Gradually: They are not only small and incremental changes that would be easier to review, but also in case it causes a bug in production, it will take us less time to recover the service and do a simple rollback.

Avoid “BigBang deployment” and prefer gradual implementations such as Canary or Blue-Green deployment.

4. Leverage Tooling & Automation: We try to eliminate manual and repetitive work as much as possible. We review how much emergency work we have (toil) and we try to automate these tasks with scripts from Bash, Ansible, Gitlab CI, Jenkins, or any other tool.

There is no reason to have a groups of engineers 24/7 manually restarting or scaling docker containers when we have kubernetes to do it automatically, as an example. Many recurrent tasks can be replaced by a script.

5. Measure Everything: the metrics of the system and human talent are a fundamental indicator for success, it is useless to take the metrics and not evaluate them. Without a way to measure the evolution of the four previous pillars, we will not have a way of knowing how it goes. We must measure the time to failure, scalability, the amount of work of the personnel and health of the systems.

Conclusion

If we think of DevOps as a philosophy, the objective of the SRE is to carry out that philosophy, The SRE implements DevOps and things are not necessarily implemented in the exact same way as they do in other companies

The common goal is to break down organizational barriers and deliver better software faster.

HD De León Barrios

Source: https://www.youtube.com/watch?v=uTEL8Ff1Zvk

--

--