Eliminating False Alarm For Good

And actually regain your precious time and focus back

Teofebano Kristo
Xendit Engineering
10 min readAug 30, 2021

--

Introduction

It was Friday evening, a perfect time to lay back and actually take a step back from my week-long work stuff (read : Software Engineering). Chilled beer in my right hand and Netflix was starting to roll out my favorite series, what a way to start my weekend. However, a sudden call hit my phone. A foreign number, and I decided to pick it up lazily. There was a bot talking on the other end of the line, notifying me there’s something wrong with our system, and customers were impacted negatively. I literally jumped from my couch, opened my laptop, and signed in to my team monitoring system, only to find out the problem resolved by itself and everything went back to its normal state.

In another occasion, I was preparing a very ambitious project involving one of our biggest customers that was very famous for giving tight deadlines. I was in the zone already, feeling like I can do and accomplish anything that I want, I was on top of the world. Then my phone buzzed, ringing one of The Weeknd songs out loud. Again, foreign number. I picked it up, and the bot voice started to tell me that something bad is happening to the system again. I ditched the project right away and proceeded to verify the truth, only to find out it solved by itself again within minutes.

And I had enough of this nonsense. I deliberately connect the monitoring system to my personal number so I get notified if anything bad is happening and I can react to it before customers notice or even complain. Yet, this was not the case since I got bugged for something that I have no next steps for. If you don’t need to do, literally anything, to solve the problem, then it is classified as False Alarm.

And this needs to be stopped.

definitely not ok, and definitely needs to stop

As a little bit of context, my team is using Datadog as monitoring service platform, and connecting it with Pagerduty for incident response platform. All services data under my team scope are being collected and assessed by Datadog, and divided into multiple metrics (example: Request Per Second/ RPS, CPU usage, Latency, etc). For some cases, metrics can be assessed simultaneously to produce composite metrics (example: Product health can be derived from error rate in both POST, GET, and PATCH endpoints). If the evaluation of the metric falls or raise above a certain threshold, it will trigger an incident/alert that will then be passed to Pagerduty. Pagerduty will then record it and notify the team regarding the triggered incident. The flow seems perfect and ideal for my team’s use case, but it turns out there were fundamental problems when we set it up in the first place, thus false alarms occurred.

Then What’s the Problems?

Right intentions with right efforts at the right time will always yield good results

This quote strikes me hard, and we can always derive 3 things from this quote in order to produce good results

  1. Right intention
  2. Right efforts
  3. Right time

And let’s talk about it one by one to actually understand the root cause of this false alarm problem

What is your intention setting the metrics and incident response?

Almost all of the team members, including me, can answer this question with a similar manner, and I bet you can as well.

To be able to tackle problems that are raising before customer notice them

Yes, that’s correct, but what’s next? Do you really really care about your customer or simply don’t want to get scolded by your supervisor? Do you really feel the pain that your customer is undergoing due to the problem that your system is causing? Do you happen to understand how much money your customer lost due to your system misbehaving or performing under par? These questions sound simple and philosophical, but you can only produce right efforts if you place yourself in your customer’s shoes.

Three magic words to help you set the right intention; care, feel, and understand.

I do care, feel, and understand my customer, but false alarms still occur. Why?

Next is about the right efforts. Monitoring system is only a tool, and at the end of the day it’s the owner that can make it great or burden the team.

As a regular software monitoring system requires maintenance, unless you want it to be obsolete or not relevant anymore with your current condition. The fundamental idea is actually the same as regular software. By the time you create this monitoring system, you can see it as “perfect” because you are using everything that you know in that era to construct it. However, being human, you are growing alongside with your system. Your definition of “perfect” might change as well, thus several adjustments must be implemented toward your monitoring system. Maybe 10 RPS is a lot 2 months ago, and setting that number as the maximum RPS before alert is triggered makes a lot of sense. But, how about now? Maybe your average RPS already grow to 20, thus triggering the alert almost all the time. Tracking all of your metrics in a sheet can help you get a quick overview and actually derive your current system’s performance from it. Also, it gives you a status of each of the metrics whether it’s ACTIVE or simply has NO DATA.

Example of metrics tracker

Once you have the sheet, it’s easier for you to eliminate the false alarm. At least there are 5 things that you can start doing

1. Deep dive your metrics

You are probably wondering, why I need to deep dive my own metrics. Wasn’t I the one that set that up in the first place? As it may sounds silly, deep diving the metrics actually will make you know better about your monitoring tools. Maybe when you first set that up, you were only following a guide without even knowing whether it’s correct or relevant for your service. Or you were creating it in the middle of the night when you should actually sleep instead of doing that (you guess it right, it’s me). Or anything basically, since setting up monitoring on your services are an activity that is prone to human error.

Try to look deeper on things that should not happen in the first place, like why some metrics have NO DATA. In my case, it’s because my metric is pointing to old infrastructure that is no longer being used since we migrated to the new one. The next step is simple, accept that they are no longer useful and click that delete button without any hesitation.

Hit it baby!

In other cases, evaluating the formula turns out to be useful as well. Try to do sense check whether it’s correct or not. In my case, I once used error count to evaluate error rate. While the error rate range is from 0 to 1, the error count is simply from 0 to infinite, thus it’s always triggering incidents almost every day. I even impressed by how stupid I can be sometime. I fixed it (error rate formula should always be error count divided by request count), and my life has never been this peaceful since that day.

2. Add missing metrics that are essential

After you have performed a deep dive, you will basically have a holistic view on what’s available in your monitoring system. In my case, I found out that I forget to set Traffic monitoring (please refer to Golden Metrics explanation), thus blocking me from knowing exactly what’s currently happening in my services.

In the above sheet, I can easily notice this by analyzing the values that are available in the metrics type column, so it’s strongly encouraged to classify your metrics first before exercising this step.

3. Adjust the threshold to reflect your current system’s performance.

Remember the old proverb saying Optimism is the key? In this very case, it is probably the key to your darkest nightmare. Being optimistic is important, but being realistic is critical. If you set a very ideal / optimistic metrics threshold, then you will almost certainly get bugged every time since your monitoring system will always trigger an incident.

You might say, good database average latency should be around 50ms max. While in theory it is true, but your system might perform far from that number. And it can be lower or higher than that number.

Being realistic here is the actual key. In one of the busiest services in my system, the database average latency might even reach 500ms during peak hour. Is it bad? Yes. Are you proud of it? No. That’s the hard fact that I need to accept for now. So I’m adjusting my metrics to reflect my current system’s performance, and also praying I won’t get a warning letter by doing this.

4. Improve your system

Now comes the fun part. After accepting your current condition, it’s time to improve it. Having 500ms database average latency is not good, but it will always be like that if you don’t do anything about it. You need to figure out how to improve it. Maybe, your current customers aren’t complaining about that high latency. What about 6 months from now? What if your sales team is suddenly dropping a bomb in the middle of the day saying a world wide enterprise wants to use our service? This exercise also trains you to think on how you can scale your service properly, and pinpointing which part of your system should be prioritized.

In my case, I successfully reduced the average latency from 500ms to around 200ms, or 60% from the previous condition. It turns out, this high latency is caused by the indexing problem. After analyzing the query, and doing adjustment on the index of the records in our database, the performance increase quite dramatically. Is it good now? Not yet, we can still improve it, but it’s good for now. Are you proud of yourself? Of course.

Me Right Now

5. Evaluate your metrics frequently

Now it is the perfect time to readjust our metrics. Making sure it always reflects your current system’s performance is mandatory. Out of sync monitoring systems are basically useless and only cost you money without giving any additional value towards you and your team.

Now I’ve been able to reduce the database average latency to 200ms, it’s time to readjust the threshold to 300ms. Maybe you are wondering why the threshold is not 200ms? It comes back again to the usage of your monitoring system. You want it to trigger an incident if and only if the system performs abnormally. If your system’s “normal state” is 200ms, then the “abnormal state” should be higher than that, and in this case I set 300ms as the threshold.

These 5 things are essential to really doing the right effort; Deep dive, add, adjust, evaluate, and improve.

It’s all great advice, but I still got a false alarm. You are a liar!

Well, the last thing that you can assess is whether you put your monitoring system at the right time or not. The analogy is simple. Instagram won’t be this famous if it was launched before 2010, since the smartphone penetration is still super low. The same applied to metrics and incident response.

If you are monitoring the error rate in a new product, where your customer is simply just starting the integration, then you will get notified by your incident response platform almost every hour. If you are monitoring about the RPS to your payment endpoints where it has peak time at around midday, then every mid day you will get called by your incident response platform because the RPS will always reach its peak during that particular point of time.

So to be able to produce relevant performance from your monitoring system, you first need to understand your product first, full in and out. Understanding its behavior, its limitations, and its performance is basically the key to avoid falling into the false alarm chaos.

What’s next?

All of the advice above is all based on personal and my team’s experience. It’s been 6 months since the false alarm was first raised, and 2 months since this false alarm initiative is being done. The result is simply mesmerizing, we are able to reduce the number of incidents triggered from ~60 to ~20 on a monthly basis (~66% reduction), and ~40 to ~10 false alarms (~75% reduction). We can always start by taking a step back and understand your true intention. From the right intention, you will be able to unleash your maximum effort to solve the problems at the right time, thus fulfilling the quote above.

Good luck on eliminating this painful experience from your and your team’s life, and let’s regain the focus that you and your team deserve.

--

--