Most people are annoyed to get stuck doing boring chores that could be handled more efficiently which leads to burnout and exhaustion.
A couple of years ago Google introduced SRE practice, and one of the core terms there was Toil and at first glance, it can be misinterpreted as a boring repetitive task.
However, Toil is not just a work you don’t like to do. Let me elaborate on that, and try to answer why is it important for devops and SRE.
What is toil?
In the SRE discipline, toil is the kind of work tied to running a production service that tends to be manual, repetitive, automatable, tactical, and devoid of long-term value
Thus each time when an operator needs to touch a production system that represents toil time.
Why is that important
Toil tends to scale linearly as the service grows. As such, the SRE discipline strives to reduce toil as much as possible. This approach let engineers work on the real engineering tasks so they can spend time making their services better.
Calculation of Toil helps your team to defend SLOs with the maximum efficiency.
Toil and toil budgets closely influence the desire to “measure everything” and “leverage tooling and automation”. By giving operators a quantitative measurement, toil and toil budgets ensure a balance between administering the system and improving it.
Toil budget with Amixr
In Amixr we follow SRE practices ourselves and create our product in order with it.
The vast majority of toil tasks are Pager Alerts, and since our flagship product is an Incident management tool, we know best how much time your team spends on this kind of tasks. Therefore Amixr can help you manage your Toil Budget and notify if you getting out of it. Also, we are planning to implement toil logging and surveys in the next release (add our Bot to Slack, it will notify you about such an update)
Pager alerts are almost always toil because of the fact they are interrupt driven and they’re reactive. You never know when it is going to happen.
SRE book suggests spending on toil no more than 50% of the time.
Our SRE organization has an advertised goal of keeping operational work (i.e., toil) below 50% of each SRE’s time. At least 50% of each SRE’s time should be spent on engineering project work that will either reduce future toil or add service features.
How we calculate toil
For instance at 7am the main website gets down and one of the monitoring systems pages the critical alert and sends it to Amixr.
Once an engineer starts to work on the incident and mark it as Acknowledged.
At 10 am an engineer gets service up and running back and mark the Incident as Resolved. The time they spend on fixing the problem is considered as Toil and Amixr notifies how much budget left for further incidents.