Eliminating Alert Fatigue: 9 Ways One Team Reduced Alerts by 80% in a Month

Michael Hedgpeth
Splunk Engineering
Published in
5 min readFeb 23, 2022

We have all been there at one point or another. It might be at a standup when someone joins with a groggy voice after staying up all night because they were paged every hour on the hour. It might be when the team is slacking each other during dinner time because the alerts are keeping them from getting any real work done during normal working hours. Or it might be that a key team member has to take extended time off because they’re just burned out. It’s Alert Fatigue, and it’s a very real problem.

Recently an internal observability platform team for the Splunk Cloud got into this very situation and in about six weeks was able to reduce alerts by an average of 80%. Alerts are very manageable now on the team. How did we do it? We’re glad you asked! Here are nine things we did that led to success:

  1. We recognized there was a problem. It was easy to continue down the roadmap and do more exciting great things that will delight Splunk customers. However, we recognized that ignoring this problem would lead to more burn out, lower quality of life, and ultimately a worse customer experience. This problem was more important than expanding our platform with more features.
  2. We viewed this as primarily an engineering problem. It would have been easy to think “let’s hire more people” or try to blame external factors for our issues. However, we are engineers first and thus we are going to solve this as engineers before we resort to expanding our team. And as engineers, we didn’t treat only the symptoms of the problem but the true root cause. This way, we end up having a system that is intrinsically reliable and does not require human intervention. And, from that place, we can hire people to work on valuable work instead of helping resolve repeated issues that have repetitive minor manual remediations.
  3. We had alignment with leadership. As Jeremy Rishel writes in his post, our first priority is to our customers. Our leadership at Splunk supports dropping everything to ensure customers have the best experience possible. So we got support, not opposition, from our leadership when we told them that we needed to delay other priorities to ensure our alerts are valuable to system reliability.
  4. We limited work in progress to focus the team on this outcome. We finished up a couple of projects. We delayed a few more. And thus the priority was clear: let’s focus our efforts on reducing alerts the right way. It was no longer an afterthought because we didn’t have five other items currently in progress to distract us.
  5. We measured success. We created a dashboard that showed us what we were doing. We also had to do some extra work to our alerting configuration to make sure that all alerts were properly counted on our dashboard. We then created aspirational goals to reduce that number by 20% in 30 days and 50% in 60 days. We knew, from the system itself, what success meant.
Our dashboard of alert volume over time

6. We ensured that alerts were properly defined and actionable. We properly defined an alert as evidence of an imminent reduction of service (or incident) with a clear proposed action, even if that action is to create an incident and explore. Before, an alert might have been for informational purposes only, not related to an incident. We rerouted those into an informational Slack channel that we agreed would only be used during investigation. We also decided that problems outside of working hours on staging environments didn’t warrant waking people up. This allowed us to focus on the remaining low-severity incidents that were driving the rest of the alert fatigue.

7. We continuously applied our incident process to the numerous low severity alerts. It’s easy to think of an incident analysis and after action reviews as only applying to the “big” incidents. But it can apply to anything! We did the hard work to take a step back from an alert, and ask: What is the true engineering remediation to make sure that this does not happen again? This is all a part of true Site Reliability Engineering — where you automate yourself out of this year’s job. The key is to give yourself permission to do that analysis on even a seemingly insignificant alert. That exercise uncovered so many improvements.

8. We saw a reliable system with limited alerts as a way to save money. We found a couple of cost-related mental barriers moving forward on some items related to our alerts: the first was a hesitation to work on an alert if the amount of time it took to remediate the alert was longer than the amount of time to deal with it. Fundamentally, we had to tell ourselves that the sum was greater than the parts and not to engage into too much cost/benefit analysis on these alerts. When the alerts were fixed, the system would be better. The second barrier we faced was a fear that tuning our system to not have alerts would cost more money on our cloud bill. We found that tuning our system with true elasticity solved this problem. At the end, with the system properly configured, our cloud bill actually went down, and was also reliable and alert free at peak times as well.

9. We changed our process to reduce the risk that this happens again. After completing the project, we were ecstatic that we were able to reduce alerts by 85%. But also, we were worried about getting right back into this situation a few months later. So we decided to change our process: during our weekly team meeting we identify alerts that require remediation? And from that exercise we put work on our backlog right now and make room to keep our environment in the right state. With this new process in place we’re confident we won’t get anywhere close to the alert fatigue the team had to endure.

We hope you can learn from us about how to eliminate alert fatigue on your team. Burnout is real, especially during the pandemic. We have found that taking the time to focus on the root causes and remediations of the problems in front of us is a win for everyone: our employees, our shareholders, and most importantly, our customers.

What other tips/tricks have you implemented to help alleviate alert fatigue? Let us know in the comments.

--

--