4 stages to working effectively with an error monitoring system

José Alberto Zimermann
heycar
Published in
6 min readJul 1, 2021
Photo by David Pupaza on Unsplash

For some time now we have been using here at heycar a tool that monitors in real time errors and exceptions that happen in our systems and components.

In an engineering organisation formed by several agile teams that work autonomously, the team I work for went through a learning journey regarding the most effective way to deal with these errors that affect our users on a daily basis and, many times, go unnoticed by our eyes before we deliver our product into production.

The idea of this article is to share the main learnings we had until reaching a point of productivity with the tool and addressing the problems encountered: definition, monitoring, screening, and treatment.

A word about the difference between error monitoring system and a logging tool

Before we introduce the four learnings we had, it's important to differentiate between what's an error monitoring tool and what's a log aggregator tool.

While a log aggregator tool is responsible for providing all events that happen in an application, whether this event is an error or not, an error monitoring system is responsible for focusing only on exceptional and untreated application events.

It is important to stress this distinction because one tool does not preclude the need for the other to exist. They do, however, serve different purposes in the software development cycle.

# 1 Definition

Define your logging policy and have it clear to your engineering team.

This was an important step that took place before we effectively used our current tool — which is based on logs submitted by our applications: to clearly define which logging levels we use and in which situations these levels apply.

This may seem trivial at first, but it is extremely important that the correct problems detected in your application are sent to the tool. This directly affects the workflow of the tool and the team’s ability to analyse the problems in a timely manner.

We adopted the following approach in our team: following the Simple Logging Facade for Java (SLF4J) we decided to use only two levels of logs in our applications:

a) info: used to have traceability of what is happening in the application and understand how the services behave in the various scenarios that are used. We use them as often as we think is necessary to get this understanding;

b) error: used when an operation fails in a non-reversible way (for example: when we save something in the database and the application cannot get lock on it);

Finally, we configured the integration between our application and the tool so that only error logs would be sent to the repository, giving us accuracy that the events coming into the tool were indeed errors according to our understanding.

# 2 Monitoring

Monitor the events coming into the tool.

Screenshot of one of the the alerts sent to a team’s Slack channel
Screenshot of alerts sent to a team's Slack channel

Establishing a continuous flow of monitoring is key to addressing error events quickly and responding effectively to detected problems in the applications.

In our team, we mainly adopt two different approaches in order to educate ourselves and keep a frequent eye on the error events in the applications the team is responsible for:

a) weekly meeting where we, as a team, discuss the top events displayed in the monitoring system that have not yet been addressed. We also take the opportunity to assign a person from the team to address each of these top issues;

b) alert message sent to a team's Slack channel for each new issue that appears in the monitoring system.

As part of ways or working, the team is responsible for continuously monitor the channel and raise the hand whenever the issue demand a quick intervention, like a version released that breaks an important flow or an a high-volume of issues happening unexpectedly.

In case it's an issue that doesn't demand immediate addressing, we add it to our backlog (check more on the #4 Treatment).

Creating these monitoring mechanisms, especially automating the alerting of new issues to the team, has helped us to react faster to the issues that pop up on the different applications.

# 3 Screening

Treat the found event as a bug and prioritise it according to its relevance.

Screenshot of a Jira ticket created with an issue caught on the monitoring system

Like any other error reported in our systems and/or components, the events detected by the tool follow the following flow:

a) analysis: the error is analysed by the development team, which seeks information about where the problem is happening and what causes it; what is the volume of users impacted by the problem; what is the complexity of the problem and the impact to the business;

b) tracking: the error is tracked in our backlog and all the analysis details are stored;

c) prioritisation: the team (with support from the Product Manager and Engineering Manager) prioritise the resolution of the bug found according to the need and scope of activities.

Talking about the prioritisation of the bug found is essential for several reasons, either by the accumulation of technical debt that may occur during the team’s life cycle or by the necessary balance between delivering business value versus fixing existing product features. There are several factors that must be balanced and discussed in the decision making process.

#4 Treatment

Make sure to reduce the volume of false-positive errors for the benefit of your team.

Screenshot of the monitoring system displaying the errors for one of the codebases
Screenshot of the monitoring system displaying the errors for one of the codebases

Without a question, one of the greatest lessons we've learned in our journey to working effectively with an error monitoring system was that our effectiveness as an agile product team that is always delivering business value to our customers is to consistently address the exceptions existing in our systems.

We have learned of being diligent about what we consider error and/or exception. Below there are a few cases where false-positives were removed from the monitoring system, making it more reliable:

Case 1: an exception happens and is graciously handled by the code, there is no need to send that issue to the monitoring tool. Instead, a simple info log is informative enough for investigative purposes.

Case 2: a sub-dependency of a library of a React web-application generates an untreatable error. We decided to simply filter the event out and avoid polluting the monitoring system with an error that hardly could be handled by the team.

Case 3: an event is incorrectly sent to the monitoring system due to a wrong understanding of logging policy (info vs warning vs error).

Is finding a needle in a haystack an easy task? Probably not. Detecting and reacting to a new error when you already have thousands of other problems is an equally difficult task and this is why it's important to remove possible false-positive issues from the monitoring system, allowing people to focus on solving real-world problems that will benefit the end customer.

Conclusion

Finding the right balance between agile delivery in the lifecycle of a product and efficient resolution of exception behaviour and errors is part of maturing agile teams and should be discussed and improved at the same speed as delivering business value.

Our learning journey showed us that the involvement of the whole team is fundamental, especially in autonomous and cross-functional teams: including designers, product managers and engineers with the goal of improving the quality of the final product developed and leveraging the reliability of the features used by our customers.

In our case, the mentioned balance was found through the four stages explained in this article: definition, monitoring, screening, and treatment.

Special thanks to the Martech team (Guillermo Campelo; peeyush singla; Gustavo Tonietto; Mahmoud Elawadi) for always pushing the boundaries and adopting the best Engineering practices.

--

--