Debunking the seven most popular Site Reliability Engineering myths

Sometimes, you need to learn and unlearn SRE for better results

Jannah Zulkifli
DBS Tech Blog

--

Written by Jannah Zulkifli and Mukta Rani Chauhan

Site Reliability Engineering (SRE) is steadily becoming one of the most sought-after practices in tech companies as it allows teams to make improvements based on issues surfaced. Five years ago, we made the decision to prioritise Site Reliability Engineering Transformation in our operations. We’ve since uncovered numerous misconceptions, and are now debunking seven of the most common ones.

Myth 1: SRE is the new and improved version of DevOps
There have been numerous debates between DevOps and SRE — what are the similarities and differences and which is better.

While DevOps is an overarching concept and culture aimed at ensuring the rapid release of new applications and features, SRE is about injecting software engineering practices — and a new mindset — into IT operations to create highly-reliable and highly-scalable systems.

DevOps was initially conceptualised to help development and IT operations teams to collaborate more and shift from working in a waterfall to an agile methodology.

However, while the idea of forging team spirit to see an increase in releases sounds great in theory, this may not be the case in real life. With faster iterations and more features released, the probability of applications experiencing outages also increases, especially when the applications and systems scale.

Although DevOps aims to foster better collaboration, the objective of development and IT operation teams remains conflicting. While the development team aims to release as many new features as possible, the IT operations team’s goal would be to manage and prioritise as many releases as they can to maintain the system’s stability. Often, conflicting priorities hamper progress and stability, leading both teams to work in silos. This is where SRE comes in to improve the situation and drive DevOps’ success.

SRE looks at how the use of software engineering can solve IT operational needs, through the use of concepts such as error budget, to find a balance between releasing new features and maintaining stability.

There’s also the automation of toil and changing the team’s mindset to look at failures as an opportunity to learn and strengthen the system.

Organisations should not look at SRE as the new and improved version of DevOps. These should be viewed as two sides of a coin as both concepts complement one another.

Myth 2: SREs work to ensure 100% uptime
This is one of the biggest anti-patterns of SRE. When an application team achieves 99.9% reliability, there is a tendency to work towards 100%. However, SRE is not about reaching a goal of having zero outages but achieving a sustainable and appropriate level of availability and velocity when it comes to releasing new features.

In SRE, the error budget concept means that there is a threshold for the number of times a system can fail based on the accepted tolerance level by business. It is a tool that helps the engineering team prioritise between releasing new features and maintaining reliability.

While uptime is important, the customer experience needs to be understood and constantly monitored. Error budget can therefore be used to understand and continuously improve customer experience.

Myth 3: Normalising experimentation and innovation are utopian and unreachable ideals

Enabling teams to experiment and innovate allows for quick iteration, which is essential in SRE. Moreover, with the use of continuous integration and continuous delivery (CI/CD) pipelines, teams can always do a quick rollback when something goes wrong.

In fact, an organisation with strong experimentation and innovation culture tends to have a better chance of succeeding when implementing SRE.

The key to getting teams to experiment and innovate is to start small. For example, an innovation team could be set up to experiment and discover solutions and improvements, which could be shared with other teams.

In 2018, we started experimenting with Chaos Engineering at a small scale. Today, we have our in-house engineered tool, Wreckoon, that has been used by over 300 applications to test for reliability.

Myth 4: SRE is only used to treat software problems

Establishing a blameless culture is often neglected or seen as an afterthought while teams focus on problem-solving. However, this may cause an individual to sweep essential information under the rug for fear of being reprimanded when an incident occurs.

Establishing psychological safety is at the heart of cultivating a blameless culture, and to achieve that, teams should conduct Blameless Incident Retrospectives. Team members need to understand that they’re in an environment that allows for failures, which will enable them to experiment and innovate without the fear of getting reprimanded.

It’s also important to remember that when an incident occurs, those operating the system may not have the right information or complete insight to determine what went wrong.

Part of SRE is having the mentality of looking at the broken part of the application as a component of a distributed socio-technical system. Since human errors are inevitable, one way to tackle the problem is to look at it from a perspective of how can the system enable the people operating it to have insights into what the system is doing. This brings us to the topic of observability.

When teams practice observability, they’re inclined to improve the system by developing or leveraging sound instrumentation within their ecosystem that provide insights and alerts during a problem situation. Think of these monitoring tools like an x-ray machine.

Through the practice of observability, teams would have information that would empower them to take corrective actions at their fingertips.

Myth 5: Observability and monitoring is only achievable on cloud-native applications

Contrary to popular belief, SRE can also be applied to legacy apps. Again, on the topics of observability and monitoring, teams can start by looking at the application from the perspective of identifying and measuring essential metrics to view what happens inside the system.

Over time, teams would develop an eye for uncovering weaknesses found within the legacy app, the root causes, and solve the problems. This in turn improves the resiliency of the system, which is an SRE method.

That said, legacy apps may not be equipped with monitoring features, making the measurement of important metrics within the system difficult.

When implementing a monitoring tool and measuring data, there is no one-size-fits-all solution as every application would prioritise different metrics to capture. What teams can do is to consider using commercial solutions or an open-source framework that would best help their application capture data that is useful for them.

Myth 6: Full- stack developers and site reliability engineers do the same job

The role of a full-stack developer is to work on the frontend and backend of applications. The ideal goal however, especially for large organisations, is to have a full-stack team that consists of people with various skills.

For example, an application team would typically consist of frontend and backend developers, site reliability engineers, tech writers, scrum masters and many others as opposed to a single person operating the application or system. A team working together is more sustainable and truer to life, especially when dealing with a large, complex, distributed system.

While the developer’s job in a DevOps team is to build, the job of the Site Reliability Engineer is to engage the developers, and guide them around architecture, implementation and drive the availability and agility of the application at a sustainable pace.

Myth 7: Everyone can become a Site Reliability Engineer

Site Reliability Engineer is more than a title as the job calls for a person who can recognise that SRE is a fundamental feature of an application, and is able to bring people together to solve a problem collaboratively.

Site Reliability Engineers tend to have a natural inclination towards approaching problems from a generalist system thinking. They work like detectives who are always seeking improvements and solving problems.

Although they work to improve operations, they also have software development skills, with the ability to write in multiple programming languages, work using automation tools, and have experience as a sysadmin or in an IT operation role.

Conclusion

SRE employs a way of analytical thinking and requires practice. While not everyone may be cut out for it, it’s possible to be a Site Reliability Engineer with the right approach. DBS is always on the lookout for Site Reliability Engineers. If you’re interested in being part of the team, check out our careers page here: https://www.dbs.com/careers/default.page

Nurjannah is an internal communications specialist at the EASRE team in DBS who produces tech-related articles, designs graphics and user interfaces, and organises tech events.

Mukta comes with over four years of experience in internal communications, copywriting, editing, blogging and social media management.

--

--

Jannah Zulkifli
DBS Tech Blog

Communications Specialist @ DBS Bank | Graduate Student @ NUS, MSocSci in Communications with Specialisation in Data Analytics