How monitoring can identify your next production issue

Klaidas Lekavicius
Revel Systems Engineering Blog
6 min readApr 5, 2022

Nowadays maintaining product uptime and resolving issues quickly is as important as it has ever been. Modern practices encourage shorter SDLC iterations with an intention to receive feedback faster and gather data for future improvements. Reliable tests for the software are a must to have the necessary confidence in your next release, but having the ability to quickly spot and react to a production issue is just as important.
Self-organized development teams must be involved in the monitoring process and be ready to tackle any defect that disrupts the usual workflow.

Software development teams in Revel Systems are highly encouraged to choose and implement their own monitoring strategies from a variety of available tools, as long as the strategy ensures a way to actively monitor a service that is being developed. In this article, I’ll talk about a few general implementation details that our teams go through when defining monitoring and alerting resources for an actively maintained project. I’ll also mention the importance of monitoring the right things, encouraging developers to follow good practices, and keeping their software healthy.

What should be monitored? 🤔

Everything that might cause an impact on the health of software should be monitored for early symptoms of an active production incident.

Of course, this is quite a broad statement — there’s so much involved in modern software architecture these days that it’s hard to understand the whole scope. But there are definitely some areas that should be taken into consideration. Here are a few of them:

Infrastructure 💻

The heart in every running piece of software. Without a healthy infrastructure, there would be no healthy software, and this area easily becomes the most vital part of any monitoring stack. Depending on your application you might have one or more actively running resources that are either hosted on-site or in the cloud. Either way, the information about these resources should be extracted and actively tracked with the tools of your choice. Depending on the system that’s being supported, you might want to track the random access memory (RAM) and CPU usage, the number of requests that are currently being handled, latency, throughput, amount of free storage, etc.

You can sometimes hear people saying that infrastructure that is being hosted in the cloud doesn’t need to be actively monitored because the cloud provider is responsible for keeping the system stable. This is simply not true. Usually, the cloud provider is responsible for maintaining the hardware, but the owner of the resource should make sure to extract any useful information which could be used for maintaining the good health of the software. Cloud providers like AWS usually do a great job in giving the tools necessary to extract and track the metrics that are important.

Logs 📔

The practice to log the actions that are being taken by the software is important, but those logs are not very worthy if they're not monitored for any unexpected or unusual behavior.

The most common way to organize logs is to track their severity. Usually, logs should be marked to represent either informational (INFO), warning (WARN), or error (ERROR) levels. Obviously, the latter two should be monitored and constantly reviewed. There’s a bunch of information on how to track these logs and most of the logging libraries already support severity levels out of the box.

Monitoring and alerting pipeline with Datadog 🐾

There are a lot of good monitoring tools that allow consolidating information from different services or providers, enabling developers to have a unified dashboard that reflects the status of a product. The monitoring tool of choice in Revel Systems is Datadog. Since most of our services are hosted on AWS, we have a sufficient monitoring pipeline that allows us to get monitoring information into our Datadog dashboards. More information about integrating Datadog with the majority of AWS services with the help of a lightweight forwarder Lambda function can be found here. Let us know if you want to see a more detailed monitoring pipeline preview and dive deeper into its architecture! 🙌

A simple monitoring pipeline architecture connected to Datadog

Once you have metrics and logs in Datadog you can start defining areas that could be monitored and start consolidating them into dashboards. This is a great way to create a dedicated page where all metrics about a certain product could live. Dashboards provide a lot of flexibility because you can choose which metrics matter and emphasize them in your view.

Finally, you can configure Datadog monitors to trigger on a certain threshold and inform you about it. Our go-to option is to have a dedicated Slack channel for monitoring alerts, which would be pinged when something happens. Information about configuring the Datadog integration in Slack can be found here. PagerDuty is another nifty tool that we use for occasions when the service needs to be monitored with an on-call schedule. It has many useful integrations, including Datadog! 🤙

Also, communication is very important once a new alert is received. If we’re using a Slack monitoring channel to track alerts, we like to use emojis to show that we’re looking into the issue and explain the cause of it. A good old ✅ emoji is used to mark the incident as identified and taken care of. You could use Datadog tools to manage incidents as well or configure PagerDuty for an on-call schedule, but a simple messaging approach works just as great if your team is not very big and is properly aligned! 🤷‍♂️

Motivation to monitor 💪

Active monitoring won’t mean anything if there won’t be people that will be aware of the necessity to check the status of the system in a case when an alert pops up. Therefore, here are some of the common things we make sure are made when configuring alerts for services.

Alert loudly on critical issues 🤫

An easy mistake to make when starting to actively monitor your services is to have alert notifications on every single piece of infrastructure or system handling logic you have. As various monitoring channels are designed to push loud notifications to engineers, it is important to make sure that you’re sending information that is actually important to react on quickly and might make a major impact on the service’s health. Not following this rule will make it easier to miss important alerts, as people tend to react less to very frequent monitoring notifications if they are not meaningful. Though less critical monitoring data is also important to track for the longer-term health of the service, therefore consider them as a separate use case, and have a way to review them at least once or twice a sprint.

Make alerts 👉 explicit 👈

Ensuring that alert messages contain some explicit explanation and possible root causes will help to diagnose the problem faster within the engineering organization. Make sure to spend some more time analyzing the cause of the possible alert and write that information down in your monitoring channels or have links to documentation.

An explicit alert with a possible underlying issue to investigate

Review monitors periodically 🧐

It’s also very important to keep your monitoring system up to date. Spend some time looking into dashboards and seeing if all metrics still make sense. Maybe there’s an alert configured to trigger under a certain type of error message, but the underlying logic was updated months ago and the alert is doomed to never trigger. Maybe there’s a metric that tracks the memory usage for a fixed amount, but the system has adopted an autoscaling solution and your monitoring might throw false positives. Keeping things updated helps engineers to understand the important things and not over-complicate themselves with monitors which might not be even relevant anymore.

Have a monitoring strategy now, thank yourself later! 👏

In Revel Systems, we take ownership over the development & release process for any product we’re working on. Apart from thorough testing we also implement a good monitoring system which helps us to react instantly in case of emergency! 🚨

Stay tuned for some more content as we’re planning to continue sharing our best development practices here 🙌

--

--