From Chaos to Clarity: How we reduced DevOps context switching & unplanned work while accelerating the R&D

Published in

Yotpo Engineering

6 min readMar 26, 2023

TL;DR

Are you tired of chaos? Unplanned work in your DevOps role? Do you want to increase your R&D organization’s velocity? At Yotpo, the data-driven approach helped us dramatically reduce unplanned work and increase development velocity.

We have managed to reduce the number of requests to DevOps in Half! How did we achieve it? By creating visibility and conducting lots of experiments until we achieved clarity.

Want to know more? Just scroll down…

1st Stage: Chaos

At first, like lots of organizations — We had a Slack channel where any person from R&D could ask any questions or make any requests to DevOps. This is the only channel we got requests in (no emails or Direct Messaging). Most of the questions/requests were not even in the DevOps domain, and there was no visibility. We asked ourselves:

How many requests are we getting per day? Per week?
How many requests did we manage to resolve per day? Per week?
What was the area people were asking the most questions about?
In which areas there are the most requests/questions?
How long does it take to resolve a request?

We were in the dark. We had so many questions — but no way to answer them…

2nd Stage: Visibility

We started by tackling visibility with a cool tool called “halp.” It turned every conversation in our DevOps Slack channel into Jira tickets, allowing us to see how many requests we were getting and in which areas. We used this visibility to experiment with solutions. We gathered our DevOps leaders and looked at the data every week, and from this forum — lots of ideas were generated.

3rd Stage: Solutions

We started to experiment — every week, we defined an experiment and defined a KPI to see if the experiment worked or not. Some of them worked incredibly well, and some less so. Here are a few of the experiment’s examples:

Categorization of Requests

The first thing we wanted to have visibility on — is in which areas we get the most requests. In other words — where should we focus?

We started by using the “component” field in Jira, and manually went over all our requests to categorize them— this gave us a first clear picture of what is going on!

Data Gathered from our slack channel, areas we are going to handle

It was great and exciting! But it was not enough. We needed to see the trends, to see the impact of each change we made, and if the impact was positive or negative.

One of the challenges we faced is to have a mechanism that will make sure that every Jira we closed is categorized.

So the challenge was how to categorize every new request. At first, we asked the DevOps members to categorize items every week — it did not work. We thought about asking the developers to do it — but that did not work either. The only thing that worked is making the component field mandatory when closing a Jira. Once that was done, we could easily measure the impact of each change in future experiments.

Auto Answers Experiment

We saw that a lot of the questions repeated themselves, and the answers were also the same. So we composed auto answers based on keywords, and in this way, the response was sent immediately, without our involvement.

The KPI for this experiment was how many questions can be auto-answered in a given timeframe (week). The impact was less than what we imagined — 5.6% of the questions were auto-answered, but we were looking for something that would make an even bigger impact.

Responses on slack that were ticketed in Jira or auto-answered by Halp

Roles & Responsibilities

Many of the requests we got were not in our domain, so we took the time to map our roles and responsibilities, and for every request that was not in our domain, we composed an automated answer directing the requester to the correct owner of the domain. Here are a few examples:

An example that directs the developer to contact IT to gain access to the Snyk application

An example that directs the developer to the Data team to have help with Airflow

This reduced the load and provided a faster response for the R&D since they knew exactly who to contact to get a quicker response.

RnR Auto answers according to key phrases

Priority Queue

In our Slack channel — it was FIFO, or better yet — the one making the most noise would be the first serve. Following the move to Jira, we defined the level of priorities and changed our queue from FIFO to a priority-based queue. Items were handled first by priority, and then by their age. We ensured that any requestor would have visibility into the requests queue and where they stood, any user can open our tasks queue and see exactly where s/he stands — see the picture below.

This enabled us to change our approach and stop with the “everything is always urgent” way of work that our organization was used to. Now the R&D developers know where they are in the Queue and have an ETA and SLA according to priority.

Self-Services

In any area that we saw repeating requests for, we thought about what self-services we can provide so that developers could do it themselves. We decided to use Backstage as our hub of self-service generators. Backstage enabled us to create multiple self-services in various areas such as S3, RDS, GitHub Actions’ connection to AWS, and more. With it, we enabled people from our organization to do what they needed to without waiting for us.

4th Stage: Clarity

Our efforts paid off! Over six months, we were able to cut the number of weekly requests in half, speeding the R&D cycle time & lead time by reducing dependency on the DevOps team, reducing context switching, and allowing the DevOps teams to up the focus on infrastructure development.

Next Stages

We are not done. Using the data-driven approach, we continuously examine the areas our developers need help with and make sure that they will have self-services for them.

Furthermore, we are now starting another initiative called “DevOpsClub” to give more freedom to the R&D teams. The DevOpsClub is a team of developers who are trained on DevOps subjects and given DevOps Permissions. They are granted 20% time allocation to do DevOps tasks and further increase their R&D group velocity — we will tell you more about it in another blog…