3 takeaways from SRECon2022: What it means to measure reliability, choosing the right metrics to track it, and how Site Reliability Engineering contributes to DevOps

Fabian Tay
DBS Tech Blog
Published in
5 min readJun 21, 2022
What metrics should one use to measure reliability?

March 2022 marked an important milestone for DBS’ EASRE (Enterprise Architecture Site Reliability Engineering) team. It was the first time we publicly set up a booth and exhibited our SRE capabilities at SRECon22 Americas in San Francisco, among globally recognised tech giants like Meta, Google, and Shopify. It felt surreal, knowing how far we’ve come since we started implementing SRE in DBS five years ago.

The attendees, who were predominantly from the US, were surprised to see an Asian bank making waves during the three-day conference. Through our interactions with other engineers at SRECon, we learned that they too, shared the same challenges when implementing SRE, such as gaining true observability, and finding the right formula to implement SRE in the organisation.

Our experience at SRECon 2022, Americas

Here are our top three takeaways shared by leading SRE experts.

1) Tracking the Mean Time To Recover (MTTR) is not a reliable measure of performance or reliability when it comes to incident management

Courtney Nash, a researcher from Verica, conducted a presentation that focused on system safety and failures in complex socio-technical systems.

Courtney shared about the Verica Open Incident Database (VOID) Report, which researches and consolidates publicly available software-related incidents and makes them accessible to everyone. This helps raise awareness and understanding of software-related failures within the industry.

Through this study, Verica uncovered that tracking Mean Time To Recover (MTTR) is not a reliable measure of performance or reliability. What we should instead do is collect socio-technical incident data such as:

1) Service Level Objectives (SLOs) and other sources of customer feedback

2) Number of teams, people and tools that were involved in incidents

3) Themes and narratives, which would reveal patterns and similarities across incidents

4) Near misses, which are close calls of incidents that were avoided, so that they can be used as case studies

And instead of finding the root cause, we should take new approaches that include:

1) Treating incidents as opportunities to learn

2) Being in favour of in-depth analyses over shallow metrics

3) Treating humans as solutions, not problems

4) Studying what goes right along with what goes wrong

Further reading:

2) Choose to measure the right metrics to answer the right questions when it comes to reliability

How do we measure reliability? According to the Blameless approach, reliability is a journey for every organisation, and it is heavily reliant on business needs that will vary and change during the different stages of the journey.

A metric-driven prioritisation was created to help SRE leaders to assess and prioritise what would make the most impact on their application’s reliability based on real data at every step of their journey. There are three steps involved:

1) Ask: “What are the business needs? And in the areas of Change Management, monitoring and detection, incident management, and continuous improvement, it is to ask what matters to you right now?”

2) Choose the right metrics to answer the question and perform gap measurement

3) Create a reliability dashboard

Further reading:

3) SRE plays a part when implementing DevOps in an organisation

For the first time, Dave Stanke, a developer relations engineer at Google, presented how DevOps Research and Assessment Group (DORA) studied the use of SRE across technology teams to evaluate its adoption and effectiveness. Findings were published it in Google’s State of DevOps 2021 Report.

According to Dave, in an organisation that has been practising DevOps, the scope of work tends to expand and include not just how software is built, tested, and deployed, but also the initial stages of working with business and product teams. As DevOps is the fundamental end-to-end process of how organisations approach software development, SRE covers a portion of DevOps which ensures the software remains reliable after making it to production while balancing the release of new features.

DORA, the longest-running academically rigorous research programme by Google that has been ongoing for seven years and counting, studies the various software developments and DevOps practices that make teams and organisations successful. The study, which examines data collected from more than 32,000 professionals worldwide, aims to share data-driven insights on the most effective and efficient way of developing and delivering technology.

Part of the report includes a Software Delivery and Operational (SDO) Performance metric, which is a benchmark for companies to assess their DevOps performance. Through the assessment, teams or companies can understand if they are elite, high, medium, or low performers.

Here are some useful SRE details from the State of DevOps 2021 report:

1) While DevOps scope takes the entire developer’s journey into account, SRE assists with DevOps in the areas of testing, deployment, and operation

2) SRE is widely practised, with 52% of respondents reporting the use of SRE practices

3) Elite performers are 2.1x as likely to report the use of SRE practices as compared to their counterparts. However, there is more room for growth, as only 10% of elite respondents indicated that their team had fully implemented SRE practices

4) The impact of SRE mitigates burnout among people, and brings balance between coding and operations work

5) SRE ensures “shared responsibility” for operations and predicts higher reliability

6) SRE helps predict better business outcomes

For further reading:

Conclusion

SRE promotes continuous learning, an improvement mindset and culture within not just an organisation but the wider community. MTTR has proven that it isn’t that reliable a measurement when incident management is concerned. As for reliability outcomes, teams need to opt for the right metrics first in order to answer the right questions. Lastly, SRE plays a part of DevOps.

Through SRECon, presenters and attendees can share their research and learning with others, igniting new ideas for innovation that could help improve application and system performance.

Fabian has been with DBS for the past seven and a half years. He first worked in Technology Services prior to joining EASRE in 2019. He now drives the Site Reliability Engineering Transformation and Observability Programme in EASRE.

--

--

DBS Tech Blog
DBS Tech Blog

Published in DBS Tech Blog

Best Bank for a Better World — hear it from the Engineers who build it

Fabian Tay
Fabian Tay

No responses yet