Monitoring Best Practices That You Need To Adopt

Published in

Duda

7 min readJan 2, 2023

Usually, the first stages of any startup are to deliver the product as fast as possible. As success grows, the product evolves, the number of moving parts increases, and new features are constantly developed. That means you need to be sure the product (now used by many happy users) is stable.

One part of this is DevOps level monitoring, i.e., making sure the servers and databases are alive, etc. In this blog post, I’ll focus on another part: application-level monitoring. As a developer monitoring my own features, I want to know if something unusual or unexpected is happening.

I’ve been working at Duda for almost 5 years, and I’ve had the opportunity to undergo this process in a few teams. Starting at close to no application level monitoring, we understood that adding it to our work process would give us more confidence in the features we released to our clients.

QA is an important part of the process but it’s mainly done during feature development, and not over time. Automation regression tests are also a very valuable tool, but it’s impossible and inefficient to test every possible thing all the time. Also, it’s often done in a sterile environment, which doesn’t necessarily reflect all real-world scenarios and states.

So what do we do? We add monitoring to our application to act as an additional “never-asleep” set of eyes.

Logs, logs for everyone

In our day-to-day work, logs are very important to understand how our application behaves in real life. As developers, we try to foresee which outcomes in the code are “bad” and produce warning logs about them.

Sometimes we add logs but don’t expect any flow to get there. If it does- we know that we missed something. another option is that we added a log for some edge cases we didn’t expect to happen a lot, but the volume is higher than we anticipated we can understand that it requires attention.

The trickier part is making yourself aware of unexpected things that could happen.

At Duda, as we work in squads with well-defined areas of responsibility, we created some generic per-area alerts that catch any irregular amount of warning and error logs in a certain period of time. That way we are notified about anything out of the ordinary in our area of responsibility, and we know we’re less likely to miss even the things we didn’t expect.

We use Logz.io integration with Slack to get notified of logs according to a query, with varying levels of severity defined by a certain threshold. The Slack messages are sent to dedicated channels we created that all members of the team know to give a high priority.

Automation is your friend

Other than the manual work we can do with logs investigation, it can also help us automate some processes. For example, we use it to identify issues with newly deployed versions.

At Duda, we had a journey in recent years of moving our monolith from monthly deploys, to weekly, then to daily, and currently, we have multiple deployments every day. When the deploys were monthly or weekly (and the company and product were much smaller), we relied on manual monitoring during deployment to identify abnormalities. As we moved to daily or even multiple deploys a day, and the product grew, it became unreasonable.

So what do we have today? We automate it!

During the deploys, warning and error logs are automatically compared to the ones in the previous version. If a certain threshold is being passed, we are notified about it via Slack. It’s up to us to decide if it’s acceptable and the deployment can continue, or if we need to abort and rollback to the previous version (you can read more about our deploy process here).

One graph is worth a thousand words

Another aspect of application-level monitoring is understanding how the application performs.

Though it’s less helpful in terms of identifying faulty flows or bugs, it can help us get insights and act on them, which eventually affects the experience of our platform users (for the better of course!).

We use Micrometer and Prometheus to gather a large variety of metrics (as defined by the developer) and visualize them in Grafana. Those visualizations can help us understand if the application is behaving as expected, if some optimizations might be needed, etc.

Visualize to make better decisions

We recently developed our new Site List, which is the entry point in the platform that displays a list of all the user’s sites. We added a new site filtering option that we assumed would be ok in terms of performance, but knew that there was a possibility we’d need to optimize it. We didn’t want to invest time in it prematurely, so we added a metric of how much data is returned each time that this filter is applied and created a histogram in Grafana to display this data.

Seeing the distribution of the actual results made us feel much more comfortable with postponing any optimizations for now.

Detect abnormalities as they happen

Another way to utilize these metrics can be by identifying abnormalities in trends. Collecting metrics over time gives us a good perspective of what “normal” looks like, so if we suddenly start to see a change in that trend without us doing anything intentional to affect it, we can do something about it.

On our platform, we have a Job that regularly performs cache-cleaning on sites based on certain application events. We had a metric reporting how many sites were processed in each run of the Job. This data is visualized in a graph in Grafana and usually looks pretty much like this:

As you can see, the numbers have a relatively defined range of normal.

At some point, we noticed an increase in the number of sites processed in each run, which made us suspect that something was not right. We investigated it and found an issue that we were able to fix. Finding this type of issue purely using logs is pretty hard, because it wasn’t producing any unusual logs. But, it affected a known trend, which we were able to see pretty easily using these metric reports.

Change is not an event, it’s a process

One thing that I really want to emphasize is that using any of the abovementioned monitoring options (and probably others as well) will most likely be a process.

From my experience, every single time we’ve set up alerts to notify the team via Slack, intending it to help us, we suddenly had an influx of messages notifying us that something required our attention. Since we’re full of good intentions and want to fix everything that needs to be fixed, we gave full attention to each of the alerts, learning that most of them were okay and didn’t require any action.

After a relatively short period of time with a few alerts every hour, they became the “boy who cried wolf” and were easier to ignore.

Unfortunately, as part of this, when something happened that actually required attention, it was missed and was only noticed much later than it should’ve been. The lesson learned is that it’s important to keep in mind that it’s a process, and not a one-time thing. We need to maintain and refine these alerts over time.

It could be something as small as changing the alert to be displayed as more informative when we receive it, changing the alert thresholds, or bigger changes such as modifying certain logs to provide more information. All of those can help the alerts work for us, and not the other way around.

As for metrics, it takes time to build the habit of thinking, “Which metrics should I add to help me understand something?” while working. For me, it took several PRs from another member of my team with comments like “You should add a metric about this” or “We should measure that” to adopt this thought process on my own.

Having said that, a big part of this process is making sure you don’t overdo it. Having metrics that don’t teach or help you in any way is unhelpful, but could also be harmful. It can be discouraging if you feel that you’re putting in the effort to create them, but not getting anything out, and that you can’t see the forest for the trees. It’s okay if you think something is helpful at first, and found out it doesn’t serve its purpose — just remove it, and learn from it for the next time

And now your journey begins…

In my team, now that we have all of this setup, we all feel much more confident in our work process. Now it’s your turn! There are countless tools and ways to help you. The key is starting to incorporate them into your dev process. Keep in mind that in spite of possible growing pains, you will eventually find the way it works for you and is helpful. It’ll be totally worth it!