The Art of Managing Non Critical Production Alerts

If your company has a software in production for some time, you are probably familiar with this story. One of your teams is the keeper of a critical software in production. You are a small or medium size new age company who doesn’t have a NOC (those are so 1900s anyways!) and you have setup various alerts to notify you in case something goes wrong in production. These can be critical alerts that are setup with a combination of Sumo Logic / Splunk + Pager Duty or Cloudwatch (if you are in AWS) + Pager Duty. There is a whole another category of alerts where even though they are not critical all the time, you want to know what’s going on and hence instead of using Pager Duty, you have them go to your email or a slack channel. This category also includes somewhat general alerts such as ‘errors > 500 per minute’ where a human being will decide whether this alert needs an action or not. In some cases, these errors also may get fixed by themselves in few mins. For example if a partner is sending a wrong value to our Ad Server, it’s possible that they will discover it in few mins and fix the issue. In Ad Tech this is a common scenario as many companies software talks with each other using Real Time Bidding protocol in real time.

It’s these second type of non critical alerts I am going to talk about. The behaviour I have seen is when there is an alert that triggers Pagerduty and hence definitely needs a human intervention, the team takes the action diligently and fixes the issue. But when they find out that it’s an harmless alert or the alert comes and goes away in 5 mins, they simply ignore the alert. And as a result, those alerts keep coming!

Here is an concrete example of one such alert:

An Example Non Critical Alarm Email

This alert says that the DynamoDb write capacity consumed is unusually high. Within 5 mins another alert will come stating that the consumed write capacity has become normal. In AWS speak, we call them OK alerts. This alert used to trigger (and go away) every day at a specific time. After investigation we realized that one team was running a job on DynamoDb that was causing write capacity to be consumed at an unusually high rate every day at a specific time.

The engineer that investigated the alert, realized that it’s not a big deal, this is going to be okay in 5 mins and DynamoDb will automatically scale up to meet the demand, the job will be over in 5 to 10 mins and an OK alert will come in 5 mins. After the realization, the person didn’t do anything! And the alerts came coming every single day! Every single day at exact same time I would get two alerts — the alert above and about 5 mins later one more alert — the OK alert.

Now, what should this engineer have done differently? — To begin with, when the person realized that it’s a non critical alert that’s going to come every day, the person should have simply raised the threshold of the alarm. It would have simply stopped the alert!

But the person didn’t do it, and hence we have two extra emails coming to the entire team’s inbox every single day. Now Imagine a scenario where throughout the day, there are 10 different alerts that are useless coming to your inbox at different times! You have 20 emails that are useless! And if you keep ignoring it, this number keeps growing! And then engineers starts ignoring the real alerts too because they think most of these alerts do not require any action anyways! I have also had an engineer tell me that he missed an important email of mine because it was buried under the alerts! So much for ignoring these alerts!

The morale of the story is that the engineers must try their level best to reduce unnecessary alerts coming to their inbox. While this sounds like a simple solution, in practice, it’s very difficult to implement. The engineers on call are always working on some important project and they do not have enough time (at least in their mind) to devote to these problems. The moment they realize that there is nothing critical about it, they stop looking and resume their important project work.

There are many ways to resolve the issue:

  • The manager or the senior leadership in engineering need to take this issue seriously and follow up with the engineers. At GumGum, as the head of engineering, all the ad server alerts come to my inbox and they keep annoying me! As they annoy me, I annoy my managers and engineers and demand that there are minimum alerts coming every day and the ones that come actually require some intervention on behalf of the engineers.
  • Inbox Audit: A quarterly inbox audit can be conducted to count the useless alert emails and JIRA tickets can be created to remove these alerts. Make sure that at least one JIRA ticket is scheduled for each sprint (or month).
  • Educate your engineers, explain them why such a step is important. Incentivise them to show the right behaviour. Whenever an engineer proactively takes care of a alert, praise them. Communicate clearly to indicate what is expected of them. In fact, in my opinion, if you haven’t communicated this expectation, then you have no right to judge them.
  • Track a scoreboard! — Just simply start monitoring how many alerts can be easily eliminated in a spreadsheet. Make sure this scoreboard is visible to the team. I have done this in the past and found, that I can do this simply by using email, no other systems are needed! It’s bit of tedious work and you need discipline for it, but you only need to do it once to convince your engineers the issue is important.

In case you are wondering what we did, we ended up doing #1, and #3. I am soon going to implement #2. I am sure there are probably other ways to tackle this problem too. Please let me know in the comments the ways you use in your organization!