How we divided engineering toil by two in one year

Published in

Alan Product and Technical Blog

8 min readMay 17, 2023

At Alan, we strive to maximize value for our users. In this fast-paced environment, we often prioritize what provides impact or learning as quickly as possible.

We don’t try to anticipate all possible edge cases before shipping a new feature: it could be just useless (it might never happen) or too soon (it won’t happen in significant volumes before a while), and choosing other battles is a better use of our time.

The problem

However, as time goes by, we have more users, our product becomes more complex, and what could be ignored one year ago is now a real problem.

We spend growing energy manually circumventing a missing piece in our user-facing product or internal back office. In such cases, the support team escalates requests to engineers. We used to have a rotation (called eng on-call, even though it was only active during business hours) handling those.

When we were smaller, a single engineer could manage the workload for the whole product scope. At some point, this was no longer sufficient, so we began having two engineers on-call simultaneously. When we exceeded 200k users, the system clearly reached its limits:

Engineers on shift were overwhelmed by tickets, resulting in long hours and too much stress
The full product scope had become too wide for them to feel efficient and empowered in any situation
We were using much more than 2 full-time engineers, as they frequently needed to resort to the help of experts in sub-parts of our stack
Experts often ended up handling tickets by themselves
We estimated that up to 6 full-time engineers were dedicated to on-call activities

We were simply struggling to keep up with toil as defined in Google’s SRE principles:

Toil is the kind of work that tends to be manual, repetitive, automatable, tactical, devoid of enduring value, and that scales linearly as a service grows

A dog sitting in a burning room — *The famous* **This is fine** *meme by KC Green. We have created a custom emoji for it in Slack. Guess its name?* ***:toil:****, of course!*

The way we use software engineers’ time is essentially a trade-off between running the business (toil, and more generally maintenance) and transforming the business (providing enduring new impact, e.g. through product building). Shipping faster while introducing toil sources is creating debt. It is neither good nor bad by nature. It’s a trade-off that often makes sense initially. The hard part is to identify when we should reimburse.

For us, it had become obvious that fighting toil was a priority, for at least 3 reasons:

Engineers started to get weary. We want to keep a healthy and engaged team.
Time spent on toil is time we could not spend on building features and providing new value for our users
Lastly, at that time, Alan had started a plan to reduce our cost to serve, which includes engineering run

Our strategy

We decided on a strategy relying on 3 levers: distribute ownership, measure, and provide common tooling.

Distribute Ownership

Our cornerstone decision was to distribute on-call activities to the building blocks of our organization: the product areas. Unlike product teams, which are ephemeral and get sunset once their mission is complete, areas have a long-term mission on their scope: they each own a vertical of our product.

Initially, their responsibility was only on the build, while the engineering community as a whole was responsible for the maintenance of existing features. We decided to map maintenance responsibilities to product areas. Each area created an on-call rotation and populated it with its own engineers.

This came with an important benefit. The trade-off between run and build can now be decided locally, instead of globally. If an area wants to reimburse its debt to make engineers more available for future build, they just prioritize toil reduction initiatives. On the contrary, they can decide to postpone those to not miss another opportunity.

Our pre-existing generic on-call role was re-scoped. It became responsible for dispatching issues to areas on-call and handling transversal issues that we cannot map to an area. And to remove low-value work, we ended up automating issues dispatch using a Machine Learning model.

Measure

It is always easier to get something tackled when it is measured. Especially when discussing roadmap priorities. Without metrics, the reality of a situation can quite easily get skewed by personal perceptions that suffer from various biases.

Hence we defined a north star metric: the total on-call load weighing on engineers’ shoulders. And a goal: divide it by two within one year.

We considered various ways to calculate this metric or find a proxy for it, like the number of tickets. But our on-call load comes from various sources and tools, like requests escalated by the support team, or technical issues from automated monitoring. Also, solving an issue could take from a few minutes to more than a day, depending on its complexity.

So we implemented a simple generic solution. At the end of their shift, each engineer has to estimate the share of their time that was eaten by on-call activities.

Those estimates are not perfect, but it is good enough for what matters: monitoring the trend. Below is the evolution of our overall load after we started this transformation:

*The vertical axis is the number of full-time equivalent engineers dedicated to on-call activities. Each color corresponds to one of our rotations.*

Provide common tooling

Having a metric, and knowing your baseline and your goal is only the start. A hard part follows: how to reach that goal?

We started looking for actions that can maximize impact (toil reduction) for a given effort.

And to make it easier, we introduced a toil prioritization framework. The idea is simple:

Identify the root causes: we call those parent problems. Each on-call ticket can be attached to a parent problem.
Roughly estimate the time spent per ticket. We use 4 buckets to simplify: < 15 min, < 2h, < 1 day, > 1 day.
Once in a while, roughly estimate the solving cost of the top parent problems
This allows us to calculate a breakeven point for each problem, giving a sense of return on investment if we prioritize its resolution

Our prioritization framework fully leverages our ticketing system and its export in our data warehouse. Below is the corresponding view for one of our areas:

*If break_even_years == 0.5, it means that we’ll have a positive return on investment (more time saved for on-call than time invested in solving the parent problem) in half a year*

Takeaways

In case you would find yourself in a similar situation, you may find the following takeaways useful.

Get your hands dirty

The best way to drive a transformation is to not only think and tell, but also do. You can try to understand pain points by reading feedback and metrics, but without getting your hands dirty to some extent, it might stay shallow. I encourage you to lead by example.

On my side, I found out that the following initiatives contributed to creating momentum within the team:

Regularly taking a shift in various on-call rotations
Initiating a fun hackathon project around our biggest toil source (inability to go back in time in our back office tool)
Pushing toil reduction initiatives in area roadmaps, and encouraging others to do so

Limit friction

When introducing processes, additional friction should be limited. It will be badly welcome if it has limited value, or this value is not well understood.

For instance, the first version of our bot calling engineers to fill data for metrics was a bit too strict. It evilly reopened your closed ticket each time something was missing.

After some (justified) complaints, we figured out that posting a reminder comment was enough. We don’t need 100% data. 50 or 70% is enough. So instead of being strict, we settled on an incentive and now monitor the completion rate.

Allow auto-pilot

Once you have cleaned up your home, you can be proud, but you know that it’s not the end of the story. Dust and clutter will come back quite fast.

Toil fighting is similar. New toil sources will appear. The user base will grow, turning today’s edge cases into tomorrow’s pain points.

Once we had reached our yearly objective, we decided to not set a new goal on the absolute on-call load. Instead, we created an evergreen guideline that would stay suitable whatever the future user base and team sizes.

It states that for any engineer, their average on-call load should not go over 10% of their total working time. Whenever close to the threshold or above, it gives engineers a natural lever to prioritize toil reduction within their area. It is a kind of Toil Budget, quite comparable to Google SRE’s error budget.

Conclusion

We’re coming to an end, thanks for having followed our toil-fighting journey!

One thing I like about toil is that the corresponding trade-off can virtually apply to any domain. For instance, it can accurately describe personal life situations. One day, a bar of my daughter’s cot fell off. I made a makeshift repair in seconds and thought it would hold well. However, it continued falling every other day, with me always applying the same fix. I finally decided enough was enough and invested 10 minutes in a decent repair.

This universality is what led us to start propagating what worked well for engineering to other teams at Alan. Various non-engineering on-call rotations have adopted load estimation, and the ops team is planning to adopt the prioritization framework.

Is toil a pain point for you? At work, in your personal life? Do you have a magic sauce? We would love to hear from you in the comments!

Marmot images in this article were AI-generated. Read here to know more!

We’re hiring!