Double, Double Toil and Trouble

Samuel Wong
DBS Tech Blog
Published in
5 min readAug 17, 2021

--

Applying SRE Practices to Alleviate Toil for DevOps Teams

Are you familiar with Site Reliability Engineering (SRE)? It’s a DevOps framework and practice pioneered by Google that we have now adopted at DBS to enhance our customer and employee journeys. In my role with DBS Hong Kong, I help drive SRE practices for our Consumer Banking & Core Engine Technology (C2E) team. SRE helps ensure our IT applications offer customers digital platforms with enhanced availability, performance, resiliency, and reliability. One of SRE’s key pillars is to eliminate toil for DevOps teams — to the point where at least 50% of SRE’s time can be spent on long-term engineering projects, rather than operations. In this post, I would like to tell you a bit about how we are helping our teams alleviate toil from their workloads.

So, What Does Toil Refer To?

‘Toil’ in the context of DevOps refers to the operational work for running production services that are:

1. Manual: hands-on time spent performing manual tasks such as running a script

2. Repetitive: work that is performed multiple times over a sustained period

3. Automatable: work that a machine could accomplish just as well as a human

4. Tactical: activities which are interrupt-driven and reactive as opposed to strategy-driven and proactive

5. Devoid of Enduring Value: a service that remains in the same state after you have finished a task

My Innovation Journey for SRE Toil Reduction

In supporting IT application and project management for DBS Hong Kong, a key focus is to collaborate with the Bank’s SRE engineers to identify ways to reduce toil for our applications. In September 2020, I was privileged to have the opportunity to participate in DBS Group’s Warriors of WOW programme — an 8-month innovation journey for staff to build a better BAU (business as usual). During the programme, I gained a lot of insight and support from senior management and experts on how I can continue to find new ways to alleviate toil in the Bank.

I spoke to our SRE teams and found that they still needed to perform operational work such as service health checks and log analysis needed in secure rooms. Leveraging my knowledge of SRE and toil reduction, I proposed an automation solution that would enable more of our engineers to work from home during the Covid-19 pandemic,–as opposed to being on standby to perform operational duties at the office –while also ensuring our production systems remained safe and secure. Through the automation scripts and monitoring setups in the solution, we were able to quantify the benefits — such as time saved — for management review sessions.

Our SRE team, for example, was able to reduce their amount of toil by 100 manhours per year through minimising manual health check tasks. The team has since been able to re-focus their energies on other projects to help further improve development processes.

But, how did we do it?

The SRE team collaborated with our development team to run the project via an agile development methodology. This included practices such as daily standups to understand the user stories, as well as arranging bi-weekly MVP releases. We leveraged a modern technology stack, including Python for programming, Grafana dashboards for performance monitoring and Tivoli for job scheduling.

What Have I Learned During My Innovation Journey?

It has been an incredible experience getting to apply my knowledge of SRE that has led to tangible benefits for our DBS HK teams. I’d therefore like to share my four key insights with you from my SRE innovation journey, so you may embark on your own.

1. Define your problem statement: Look to understand the BAU of the area you are looking to improve, why it happens and why it matters. In the innovation journey relayed above, it was the current service health check tasks of our SRE team and enabling more staff to work from home during the Covid-19 pandemic.

2. Engage your stakeholders: Identify your stakeholders, conduct interviews with them and share your vision on how you would like to enhance their BAU. In the context of my innovation journey, the stakeholders were the SRE team leads, our engineers, application managers and platform leads.

3. Socialise your proposed solution: Discuss your proposed solution for a better BAU with the key stakeholders and state how you plan to do so — whether it is through automation, configuration, monitoring setup etc. If the scale of the solution you are implementing is large, be sure to start with a minimum viable product (MVP) experiment to serve as a proof of concept. If required, secure additional resources from senior stakeholders to support development and implementation. When and if possible, seek to minimise resources requests to demonstrate a cost-effective solution that leverages upon resources already available.

A typical flow from identifying toils to executing the solution

4. Evaluate the project’s timeline, cost, risks and benefits: Estimate the amount of time and money required for solutions ideation. Then, identify who should test the MVP and when it can be tested. Be sure as well to define any potential risks and state how they can be managed and/or mitigated. Also, key to a successful innovation is to explain your solution’s benefits by using SMART (Specific, Measurable, Attainable, Relevant and Time-based) goals, which can be validated with data.

I hope these tips will be useful to you as you get started on your own innovation journey. These learnings will help you adhere to a data-driven and collaborative approach. I’m grateful for having had great mentors with me on my SRE innovation journey and I look forward to finding more ways to helping our teams reduce toil in the months and years ahead.

--

--

Samuel Wong
DBS Tech Blog

Project Manager, Consumer Banking & Core Engine Technology at DBS Bank