Signal vs Noise: a story about alerting

Marc Fricou
inato
Published in
6 min readMar 14, 2024
Photo by Manuel Nägeli on Unsplash

In the complex symphony of software development, observability acts as the conductor, orchestrating the harmony between system performance and user experience. Imagine this symphony as a melody of data, where every note could be a crucial signal that can alert the engineering team to do an immediate action or noise that interrupts it.

As developers and engineers, we are tasked with composing the symphony, deciphering meaningful insights from the cacophony of information that bombards our monitoring tools. This journey into observability is not just about collecting data; it’s about understanding the melodies of performance metrics, identifying the subtle cues of potential issues, and silencing the distracting noise that can lead us astray.

👀 TL;DR: What this is about?

This article is about explaining the fundamental difference between monitoring and alerting.

At Inato we were overwhelmed by alerts, and I’ve seen that in several organizations usually related to (too) simple configurations on observability tools, like Sentry.

We decided to put in place simple rules to avoid missing our focus without forgetting about the production by aligning our alerting with our business. We now manage our monitoring and alerting as an iterative process and make it evolve with our business

🎧 Listening to the sound of our system

Diagram of modern orchestra
source

Like in an orchestra, Observability comes with 4 kinds of instruments (source):

  1. 🥁 Logs or events: Logs are structured or unstructured text records of discreet events that occurred at a specific time.
  2. 🎺 Metrics: Metrics are aggregated data or calculated counts (over a period of time). You usually measure CPUs or RAM but it can relate to any kind of thing that is measurable on your system.
  3. 🪈 Distributed tracing: It makes sense when your infrastructure is distributed between several processes. It follows the activity of a transaction or requests between them and shows how services connect, including code-level details.
  4. 🎻 User experience: User experience data are outside-in user perspective of an application. It can be page views, response speed, some task success rate, etc.

Some choose in-house solutions like an ELK (Elasticsearch, Logstash, Kibana), or, more recently, Grafana which was firstly a fork of Kibana. It’s open-source so you can install it on your own server but managed solutions exist also.

At Inato, we’re still at the beginning of our concert, we observe our system with logs and metrics, principally. We’ve chosen Sentry to aggregate error logs and provide performance metrics, while most of our logs lie on GCP log explorer. I wasn’t part of the journey when the decision was made but I can tell the context thanks to our decision records! We’ve chosen to start observing our system with an on-the-shelf tool: it’s not our core business nor our core expertise. We don’t have a dedicated platform team or system engineers in 2024.

We’ve built small Typescript packages, one for browser errors, and another for server ones. It allows Sentry to aggregate events/logs that are coming from our different applications, both automatically when an error is raised and manually from a voluntary call to Sentry APIs. Our initial configuration was chosen to be simple. Everything that was coming was sent to Slack in dedicated channels, one for each environment

And that’s where the difficulties started.

💥 When symphony turns into cacophony

noisy orchestra sketch with an angry conductor
source

Just like in composing music, not every sound is relevant or significant.

Our first strategy was to denylist the noise when it occurs, allowing all events to pop in Slack, thinking it wouldn’t bother us much and simply analyzing rapidly the alert, and using Sentry’s silencing system would be enough. With ~30 alerts per day sent by sentry, it became rapidly a full-time job. Most of the alerts were irrelevant (or even repeated) and didn’t need any intervention from engineers, getting them away from what matters most: innovating and resolving issues that impact our clients.

🚨 Alerting vs Monitoring

At Inato, we first considered Slack as a tool to help us monitor our system, everything that happened was falling into a dedicated channel. An “on-call” engineer (during working hours only) was designated in a round-robin approach every week. This engineer alone was in charge of treating these alerts in real-time and dispatching them if needed. As confessed by several engineers, in reality, some of them were looking at the notifications, ready to act on them. Everyone was receiving these notifications, and even if it was clear that it was not their responsibility, it was tempting to look at them.

We were wrong about using Slack as a monitoring tool.

Monitoring is not a continuous work that you keep looking at without interruption. It is a representation of your system at a given instant (with historical data). The primary purpose of monitoring is to establish a baseline understanding of the normal operating conditions of your system, help understand an issue, and resolve bugs. It provides a comprehensive view of relevant data points that matter to your software. Sentry is full of views and tools to operate this monitoring efficiently.

Alerts, on the other hand, serve as proactive warnings that something in the system requires attention, enabling engineers to respond promptly. They are alarms that are triggered by specific events and that contain relevant actionable information to help diagnose and resolve issues efficiently.

Slack is a medium for alerts, by nature. It sends notifications to relevant individuals or teams (almost) immediately. Every Sentry issue isn’t an alert and we even discovered that most of the ones we got weren’t.

We needed to review our decisions.

📶 Listening to the true signals

With so much noise, we decided to move from an “allow all and deny when noisy” to a “deny all and allow some when relevant”, keeping the aggregated issues from Sentry for a dedicated time in the week for monitoring and deciding how to deal with them (pushing for a fix by creating a task or still ignore them). It was a real success for our focus and it reduced noise: we moved from ~30 notifications per day to ~30 notifications… per month! Sentry now aggregates relevant monitoring information of our system and sends alerts to Slack when it’s relevant for us and our business.

Now anytime we receive a notification from Sentry in Slack, we know that there is an action to do.

The direct consequence of our decisions is obvious, though: we could need up to a week to notice things that may have alerted us sooner. Silencing noise might create a deceptive sense of tranquility. A system that appears calm on the surface may mask underlying issues or anomalies. This false sense of security can delay the detection of genuine problems, allowing them to escalate unnoticed. Beware of what alerts you’re silencing!

️⚖️ what you should keep from this story

Photo by Xavier von Erlach on Unsplash

Be conscious of what is relevant to your business, what needs to be monitored, and what needs to raise an alert! Alert on things that have an impact on your business. Identify your critical path and metrics that relate to your business and how these measurements could be a sign that something is wrong.

This monitoring/alerting rule doesn’t apply only to Sentry. Another example: flaky end-to-end tests. You’ll want to be able to monitor all of them in every environment but you won’t send an alert for development branches as it would be a false alarm most of the time.

Observability is not a one-time act; it’s an ongoing process of refinement. A successful strategy harmonizes clarity and richness, precision and comprehensiveness. At Inato, we already know we’ll need to review our strategy. We recently moved from one engineering team to several squads. The next step is to notify the right squad instead of the whole engineering team with the right notification.

What about you? How do you manage your monitoring, how do you decide whether or not you should raise an alert?

--

--