Observability: Reducing noise in Elixir applications

WTTJ Tech
Welcome Tech
Published in
7 min readSep 18, 2023

Observability refers to the ability to monitor and analyze the health of an application’s system. Instead of focusing on individual services, it provides insight into how a system performs as a whole. Three types of data are aggregated to achieve observability: logs, metrics, and traces — the so-called “three pillars.” Logs are time-stamped events that happen to a particular system. Metrics are quantifiable or countable software characteristics that help us understand a system’s performance. And traces follow requests sent to an application, consisting of multiple spans that depict their paths across a distributed system and the time taken to process them.

To gain a deeper understanding of the concept of observability, we recommend you read Anita Ihuman’s article “An Introduction to Observability and Its Pillars”.

At Welcome to the Jungle, we used service-oriented monitoring for years, mainly because most of our codebase was located in our monolith. So when we switched to a distributed system architecture, we had to consider observability instead of just monitoring services.

This article is aimed at experienced Elixir developers looking to improve their observability practices and build stronger experience of handling logs and exceptions.

Scalability challenges at Welcome to the Jungle

Our company has been operating web services since we launched in 2015 and has seen its traffic increase steadily over the years. A growing number of companies are using our website to showcase themselves and more job hunters are coming to our platform to find their dream company to work for.

To support this growth, we have been hiring more engineers and creating new teams. The transition from one team working on the historical monolith to 5 teams building their own set of features in dedicated microservices has brought new challenges regarding observability.

Earlier this year (2023), we were experiencing, on a daily basis, 100,000 error logs, with Sentry reporting 1,000 exceptions. This resulted in poor observability, unhappy customers, and engineers spending a huge amount of time fixing production instead of building new features.

One of our senior back-end engineers decided to take matters into his own hands: He gathered developers from other teams to discuss our scalability issue and the “observability initiative” was born. Its goal was to gradually increase experience of this topic and implement increments for scalability.

This article details what we learned about decreasing the amount of errors and exceptions in an Elixir application.

Reducing noise as a prerequisite

If you’re experiencing an increase in unhappy customers while the engineers who work on your products are claiming that they’re not having to deal with many incidents, your monitoring stack is likely to be suffering from too much noise.

Engineers are only human, so they tend to overlook monitoring data if there is too much information. This means, of course, that the chances of missing a major error or a regression are going to be higher.

What is noise?

Let’s define noise.

In signal processing, “noise” is a general term for unwanted (and, in general, unknown) modifications that a signal may suffer during capture, storage, transmission, processing, or conversion (Vyacheslav Tuzlukov, Signal Processing Noise, the Electrical Engineering and Applied Signal Processing Series, 2010).

For observability, noise is unwanted and unnecessary information that makes monitoring services harder to deal with and painful.

It can be:

  • Logs with error levels when events should be labeled as informative or with a warning.
  • Caught exceptions that have already been dealt with or are unactionable.
  • An overload of notifications due to irrelevant alerting thresholds.

Why is it necessary to reduce noise?

Having too much noise can lead to the following problems for engineering teams:

  • A lack of confidence when it comes to any action, from deploying to production.
  • It takes too long to detect and resolve an incident.
  • A poor experience for developers, with increased mental load.

Having thousands of errors a day gives the impression that the service is hitting issues every time something occurs. It is therefore very difficult to read the monitoring data to determine whether something serious has happened when stakeholders (e.g. customers, support teams) complain about an incident or bug.

Actionable steps to reduce noise in Elixir applications

At Welcome to the Jungle, we write and deploy Elixir services. To monitor these in production, we use Sentry and Datadog.

The following recommendations mostly relate to these tools, but the general principles can be used with any other logging framework or monitoring platform.

Pick the right log level

Noise mostly comes from unnecessary error logs in monitoring applications.

Here at Welcome to the Jungle, we follow these rules of thumb:

  • An error level is set whenever the application hits an issue that has an impact and is actionable.
  • A warning is set to indicate that something unexpected has occurred but does not prevent the application from running correctly.
  • An info is set for pure information data that we don’t need to look at on a daily basis.

Interested in mastering log-level hierarchy? Please have a look at Rafal Kuć’s article “Understanding Logging Levels: What They Are & How to Use Them”.

Meet multi-line log aggregation

As seen in our tech blog article “A brief history of Erlang and Elixir”, Elixir operates on the Erlang virtual machine with an OTP framework.

At startup, multiple processes are spawned and, as errors can happen, processes can crash along the way. When a crash occurs, a multi-line report is printed, as per the below example:

By default, loggers do not play nicely with multi-line aggregation and will append one log entry for each line, resulting in poor log collection, as per the below example:

We need to instruct loggers or log collection tools to aggregate all lines relating to a crash report.

If you’re using Datadog to aggregate your service logs, we recommend looking at its documentation for automatically aggregating multi-line logs.

Filter out exceptions sent to Sentry

The purpose of Sentry is to report any uncaught exceptions in your application. As Elixir is crash-first oriented, we tend to get a lot of exceptions.

Although the Phoenix framework handles a lot of them (thanks to the Plug.Exception protocol), exceptions are still reported to Sentry because these are, ultimately, uncaught by the application service.

The good news is Sentry can be configured to filter out any exceptions we consider safe. Tuning this in your application service helps to reduce overall noise, especially in situations where your service receives a lot of garbage requests during bug bounty sessions, security audits, and, simply, attackers trying to breach your system.

The official Sentry client for Elixir already supports a list of well-known exceptions to filter out. In June (2023), we submitted a contribution via a pull request that was merged to expand this list based on our experience with noise. You can find it here if you are curious.

To use a Sentry event filter module, you must configure it in your configuration file, as below:

The module MyAppWeb.SentryEventFilter must implement the behaviour Sentry.EventFilter.

We recommend having a look at the Sentry Elixir client documentation if you’re interested in crafting your own event filter.

Define a process for Sentry exceptions

To prevent engineers from getting tired of the overload of Sentry exceptions, we drafted a process to handle new Sentry issues.

Once an exception is raised, the engineers responsible for monitoring create a task on our issue tracking tool. This process is semi-automated thanks to Jira integration in Sentry.

The issue can either be fixed right away (bugfix), in which case the ticket is closed, or not, in which case developers are encouraged to catch the exception and log an error. Once the exception is caught, the Sentry issue can be marked as resolved. This allows engineering teams to tackle incoming Sentry issues continuously while maintaining a manageable and healthy backlog.

The most important thing is to make sure that the Sentry issue is marked as resolved as soon as possible so the team can be informed and specific steps are taken for its resolution.

Wrap-up

We hope you enjoyed reading about our initial experience of observability at Welcome to the Jungle. Sanitizing our logs helped us better understand how our production systems work and enables us to go beyond.

In the teams where we have implemented noise reduction, we have seen:

  • An increase in deployment quality.
  • A greater sense of ownership of errors and alerts.
  • Better visibility of services’ health.
  • A faster response time to incidents.

One unexpected outcome is we have been able to reduce our overall costs relating to log ingestion, because every bit of noise we killed reduced our usage of monitoring tools, including Sentry and Datadog.

And all this preliminary work makes it simpler to address other challenges present in our observability roadmap, like proactive monitoring, incident management, and quality of service.

We assess the progress of this initiative once a month within a dedicated task force, with one of our senior engineers leading and sharing the incremental results.

Resources

Written by Guillaume Pouilloux, Backend developer @ WTTJ

Edited by Anne-Laure Civeyrac

Illustration by WTTJ

Join our team!

--

--