How we implemented RED and USE metrics for monitoring

Putting Prometheus, Grafana, RED and USE metrics all together to improve monitoring

On a previous article we described the importance of monitoring from end-user perspective for customer-centric companies. With this article we want to describe in deeper detail the technology stack that we chose to use in order to perform internal monitoring, the one used by our engineers to ensure all systems are working now and for the foreseeable future.

What is Monitoring?

Monitoring is the art of collecting, processing, aggregating, and displaying real-time quantitative data about a system. You could monitor the number and types of queries, errors, processing times, server uptimes and so on. Monitoring is a crucial and essential part of every software because it helps you keep your systems under control, react quickly and proactively to unexpected problems and ultimately prevent or reduce downtimes.

Amon/PRTG and why we choose to change them

Before the refactor we did on our monitoring, we had two systems used from different teams and both had limitations and didn’t completely cover our needs

Amon

Amon is a open-source server monitoring platform that runs directly on the server and while it was useful it had some limits that became important:

PRTG

PRTG Network Monitor is an agent-less network monitoring software from Paessler AG. It can monitor and classify system conditions like bandwidth usage or uptime and collect statistics from miscellaneous hosts as switches, routers, servers and other devices and applications, but it also was not the perfect solution:

Different approaches to monitoring

Before focusing on software selection we spent some time in separating the goals of the different approaches:

Our monitoring goals

You can’t improve what you don’t measure (Peter Drucker), so monitoring is the most important starting point to improve your product (performance, reliability, and so much more), we wanted to evolve our existing monitoring architecture to improve our ability to reach the following goals:

Prometheus or “How to fire up your monitoring”

Prometheus is an open-source ecosystem for monitoring and alerting, with focus on reliability and simplicity. Since its inception many companies and organisations have adopted Prometheus, and the project has a very active community with users and developers as well.

We chose to adopt Prometheus for its many features that allow us to satisfy our different needs in different parts of our software and infrastructure:

It was chosen also because it has been designed with a micro-service infrastructure like our own; its support for multi-dimensional data collection and querying are very relevant strengths of Prometheus.

Prometheus is designed for reliability, to be the system you refer to during an outage to allow you to quickly diagnose problems since each server is standalone, not depending on network storage or other remote services and we can rely on it in case other parts of the infrastructure are not responding.

Grafana - The colourful way of reading your data

Grafana is an open source software used to display time series analytics. It allows us to query, visualise and generate alerts from our metrics. The big plus of Grafana is it’s varied native integrations with a lot of data sources, if in the future we need to change or integrate new data sources beside Prometheus we can easily do that with little effort, also aggregating in the same dashboard graphics and data from different source all together .

Grafana also allows us to create and configure alerts very quickly and easily while we’re viewing the data, we can define threshold and get automatically notified via Slack if problems arise.

The Four Golden Signals

The Four Golden Signals are a series of metrics defined by Google Site Reliability Engineering that are considered the most important when monitoring a user-centric system:

We don’t use exactly those 4 metrics but we choose to work with two different methods using a subset of metrics generated from these four, depending on what we are monitoring: for HTTP Metrics we use the RED Method while we use the USE Method for Infrastructure.

From the Four Golden signals to the RED way of creating Metrics

The RED method is a subset of “The Four Golden Signals” that’s focused on micro-service architectures and which includes these metrics:

Measuring these metrics is pretty straightforward, especially with tools like Prometheus, and using the same metrics for every service helps us create a standard and easy-to-read format for dashboards that have to show the results.

Using the same metrics for every service and treating them the same way, from a monitoring perspective, helps the scalability in the operations teams, reduces the amount of service-specific training the team needs, and reduces the service-specific special cases the on-call-engineers need to remember for those high-pressure incident response scenarios — what is referred to as “cognitive load.”

Infrastructure and the USE method

The USE Method is more focused on infrastructure monitoring where you have the keep the physical resources under control and is based on just three parameters:

While this method at start helped us identify which specific metrics to use for each resource (CPU, Memory, Discs, …), our next task was to interpret their values, and it’s not always so obvious.

For example while a 100% Utilization is usually a sign of a bottleneck and measures had to be taken to lower that, also a constant 70% Utilization could be a sign of a problem, because it hides short burst of 100% Utilization that where not intercepted because the metric was averaged over a period of time longer than the bursts.

The USE Method helped us identify problems which could be system bottlenecks and take appropriate countermeasures, but it requires cautious investigation since systems are complex so when you see a performance problem:

it could be a problem but not the problem.

Each discovery must be investigated with adeguate methodologies before proceeding to check parameters on other resources.

Problems encountered during development

While implementing our new monitoring we encountered two challenges that we had to overcome.

The first challenge was to have a monitoring system fully deployed on containers, and this posed a great question: storage management. Containers does not natively offer persistent storage, so if it is not available for any reason we lose the data stored in it.

As a solution for this problem we found REX-Ray, a project focused on creating enterprise-grade storage plugins for the Container Storage Interface (CSI). REX-Ray provides a vendor agnostic storage orchestration engine. The primary design goal is to provide persistent storage for Docker, Kubernetes and Mesos. Since we use Docker it was a good solution for us.

At first we tried using its Amazon EBS integration but a problem arose: EBS lives on a single availability zone, when the container is moved to another availability zone it will lose the connection with the storage. We then switched to Amazon EFS that is available in the whole AWS Region instead, this allowed us to never lose the link to the storage even when the container was moved around availability zones.

The second challenge was to find a way to automatically and easily or programmatically generate dashboards. Grafana doesn’t offer a great deal of API and we found ourselves with the problem of versioning the configuration and also having to manually repeat patterns to create new dashboards, alerts and so on.

To resolve this and to reduce the amount of manual work we found GrafanaLib, a Python Library from Weaveworks. It allows us to generate dashboards from simple Python scripts that are easily managed and source-controlled.

Future development, are we happy with our new monitoring?

We are happy about how our new architecture turned out, it works and it’s starting to really help us keep our software under control. It provides more information, quicker and unified into easy-to-manage dashboards. In a recent scenario, thanks to the new dashboards for HTTP Services, we noticed an unusual response time from a search engine service when called by a specific client and further investigation lead to understand that there was a particular serie of parameters that slowed down the response. We were then able to quickly address the case and make it return the results in reasonable time again.

We are planning to integrate it with our continuous integration systems, this way when we create a new service a JSON will be automatically generated, that will be scraped by Grafana and the dashboard will be updated without the need to do any manual work.

Our next monitoring-related improvement will be about applicative monitoring, especially for legacy code.

Would you have done things in a different way ? Let us know on comments below.

THRON tech blog

THRON’s tech blog . THRON is a Digital Asset Management and Product Information Management SaaS that features automatic content classification (ML), real-time content rendition and real-time data analysis to perform content recommendation. www.thron.com

Domenico Stragliotto

Written by

Backend developer @ THRON

THRON tech blog

THRON’s tech blog . THRON is a Digital Asset Management and Product Information Management SaaS that features automatic content classification (ML), real-time content rendition and real-time data analysis to perform content recommendation. www.thron.com

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade