What is operations toil?
Operations toil are those repetitive tasks that every SRE has to do to make sure servers and the applications running on them are working fine. When we talk about big applications, which have thousands of servers and microservices running behind then, it can be very hard to manage all alerts and fix them as soon as possible as the designated on-call person. The way to deal with this is to take some measures to reduce alerts, or to resolve some of them automatically; otherwise, the task can seem impossible.
An SRE is supposed to spend as much time as possible automating the application environment and contributing to the application code. All of these operations tasks consume a lot of valuable time, and since they are repetitive , they can easily become toil.
Why operations toil leads to burnout
Operations toil is bad because it may lead to boredom or burnout on your on-call days, and eventually, in your whole job. It also creates confusion in terms of what an SRE should be using their time for. Operations toil creates road blocks, as well, increasing the feature delivery time for any application.
Operations toil can also reduce the learning opportunities for SREs, since you are involved in the same kind of issues and tasks over and over again. Without the proper time to train and learn, SREs don’t have sufficient time to get up to speed and write and update code for automation or application features.
The best strategies for reducing operations toil
1. Implement auto-remediation
“A human should not be the first point of contact for any alert.”
A strong strategy for reducing operations toil is to make sure that, for every alert, there is an automated response that can fix the alert right away; this will allow you to dramatically reduce alert fatigue. Every alert that you have encountered with an application most of the time has a defined set of steps to resolve it. These steps could be as straightforward as simply restarting a service, or they can be a little more complicated, like having a decision tree set up, in which the next step depends on the output from the previous step.
All of this can be automated. Always keep in mind that “anything a human can do, a machine can also do.”
2. Alert categorization and classification
“Every alert should be handled as per its priority.”
Not every alert is high priority and needs to be taken care of as soon as it arises. Reduce notification alerts and focus on real alerts. Make sure that alerts are categorized and sent to their respective channels based on priority. For example, alerts like service failures should be taken care of as soon as possible and an SRE should be informed about this as soon as possible. On the other hand, alerts like hardware degradation or swap memory threshold breaches can wait for some time. A ticket should be created for these alerts that the on-call person can address in the next business hour.
If you make sure all alerts are properly categorized and classified you will see that you are getting fewer alerts as high priorities, and you will be able to manage your precious time accordingly.
3. Add self-service tools
“SRE don’t need to do every task, so start offloading.”
Every SRE gets some requests that requires them to gather information that could be related to security, diagnostics, or something else entirely, and hand that information over to another team. You can reduce the number of all these requests by creating a self-service portal, where all requesters can collect the data they need whenever they want, all .without disturbing SREs.
This self-service portal not only helps in reducing these repetitive request tasks, but also improve the time needed in the complete request process. This strategy is particularly effective for both reducing operations toil as well as improving customer satisfaction.
4. Create proactive alerts
“Catch anomalies to prevent future alerts.”
A very good example of having proactive alerts that reduce future operations toil is anomalies detection. Anomaly detection is also known as outlier detection, and is the process of detecting any outliers when they deviate from a normal or expected pattern. In many cases, if you are able to catch a problem in its early stage this way, then you will be able to prevent multiple alerts or issues from occurring in the future.
SREs should focus as much as possible on proactive alerts, rather than reactive alerts. This will help in identifying any issues before they occur; the end result is less downtime and the avoidance of multiple future alerts and issues.
5. Analyze your on-call traffic and act accordingly
“Analyze alerts and update monitoring.”
SREs should create a model for regular analysis of monitoring data. This allows you to analyze what kind of alerts are happening very frequently, and determine how many high or low priority issues are coming in a given period of time.
Based on this data, SRE teams can update monitoring practices, add new auto-remediation, and/or fix root cause of frequently occurring issues. This will help in reducing repetitive alerts and, over time, will reduce operations toil.
Fewer alerts, happier SREs
If you spend time on reducing alert fatigue and operations toil, it will have resounding positive effects, both in terms of career satisfaction and performance. With less operations toil, we’ll be able to make our applications more resilient, more available, and SREs will be able to focus more on their skill enhancement.