How we implemented RED and USE metrics for monitoring
Putting Prometheus, Grafana, RED and USE metrics all together to improve monitoring
On a previous article we described the importance of monitoring from end-user perspective for customer-centric companies. With this article we want to describe in deeper detail the technology stack that we chose to use in order to perform internal monitoring, the one used by our engineers to ensure all systems are working now and for the foreseeable future.
What is Monitoring?
Monitoring is the art of collecting, processing, aggregating, and displaying real-time quantitative data about a system. You could monitor the number and types of queries, errors, processing times, server uptimes and so on. Monitoring is a crucial and essential part of every software because it helps you keep your systems under control, react quickly and proactively to unexpected problems and ultimately prevent or reduce downtimes.
Amon/PRTG and why we choose to change them
Before the refactor we did on our monitoring, we had two systems used from different teams and both had limitations and didn’t completely cover our needs
Amon is a open-source server monitoring platform that runs directly on the server and while it was useful it had some limits that became important:
- the monitoring was limited to just EC2 servers;
- we couldn’t use it for applicative monitoring;
PRTG Network Monitor is an agent-less network monitoring software from Paessler AG. It can monitor and classify system conditions like bandwidth usage or uptime and collect statistics from miscellaneous hosts as switches, routers, servers and other devices and applications, but it also was not the perfect solution:
- it runs on Windows machines and this has a increased operational and cloud infrastructure cost;
- it didn’t work well with autoscaling instances;
- it only sent mail for alerting and it had a laborious integration with Slack;
Different approaches to monitoring
Before focusing on software selection we spent some time in separating the goals of the different approaches:
- White-box monitoring: this way is based on metrics exposed by the internals of the system, in the white-box category we have the complete map of the system, we know every detail about process and components, there is nothing hidden or closed for us. It includes logs, interfaces like JVM Profiling Interface or HTTP Handlers emitting internal Statistics. The success of this monitoring type depends on the ability to inspect the innards of the system with the correct instrumentation. White-box allows the detection of imminent problems, failures masked by retries and, of course, plain failures :-)
- Black-box monitoring: this way is based on testing externally visible behaviour as a user would see it. We don’t know how the system works internally but we can create reports quantitatively where metrics are extended/exceeded or when they have changed significantly (a value of 10% is always a good variation reference, for better or for worse) and in different conditions (time, geographical location, connection method, different computers and/or operating systems, etc.); this methodology gives a perception about how users receive your service but won’t usually help you in preventing issues to arise;
Our monitoring goals
You can’t improve what you don’t measure (Peter Drucker), so monitoring is the most important starting point to improve your product (performance, reliability, and so much more), we wanted to evolve our existing monitoring architecture to improve our ability to reach the following goals:
- Analyzing long-term trends: how big is my database and how fast is it growing? How quickly is my daily-active user count growing?
- Alerting: something is broken, and somebody needs to fix it right now! Or, something might break soon, so somebody should check soon.
- Building dashboards: dashboards should answer basic questions about your service, and normally include some form of the HTTP metrics.
- Conducting ad hoc retrospective analysis (i.e., debugging): our latency just shot up; what else happened around the same time?
Prometheus or “How to fire up your monitoring”
Prometheus is an open-source ecosystem for monitoring and alerting, with focus on reliability and simplicity. Since its inception many companies and organisations have adopted Prometheus, and the project has a very active community with users and developers as well.
We chose to adopt Prometheus for its many features that allow us to satisfy our different needs in different parts of our software and infrastructure:
- a data model based on time series data identified by metric name and key/value pairs
- a really flexible and powerful query language that helps to aggregate data, the results can be aggregated in real time and directly shown or consumed via HTTP API to allow external system to display the data
- no reliance on distributed storage; nodes are single server and autonomous
It was chosen also because it has been designed with a micro-service infrastructure like our own; its support for multi-dimensional data collection and querying are very relevant strengths of Prometheus.
Prometheus is designed for reliability, to be the system you refer to during an outage to allow you to quickly diagnose problems since each server is standalone, not depending on network storage or other remote services and we can rely on it in case other parts of the infrastructure are not responding.
Grafana - The colourful way of reading your data
Grafana is an open source software used to display time series analytics. It allows us to query, visualise and generate alerts from our metrics. The big plus of Grafana is it’s varied native integrations with a lot of data sources, if in the future we need to change or integrate new data sources beside Prometheus we can easily do that with little effort, also aggregating in the same dashboard graphics and data from different source all together .
Grafana also allows us to create and configure alerts very quickly and easily while we’re viewing the data, we can define threshold and get automatically notified via Slack if problems arise.
The Four Golden Signals
The Four Golden Signals are a series of metrics defined by Google Site Reliability Engineering that are considered the most important when monitoring a user-centric system:
- Latency : The time it takes to service a request;
- Traffic : A measure of how much demand is being placed on the system;
- Errors : The rate of requests that fails;
- Saturation : How “full” our service is, basically how close we are to exhausting system resources;
We don’t use exactly those 4 metrics but we choose to work with two different methods using a subset of metrics generated from these four, depending on what we are monitoring: for HTTP Metrics we use the RED Method while we use the USE Method for Infrastructure.
From the Four Golden signals to the RED way of creating Metrics
The RED method is a subset of “The Four Golden Signals” that’s focused on micro-service architectures and which includes these metrics:
- Rate: the number of requests our service is serving per second;
- Error: the number of failed requests per second;
- Duration: the amount of time it takes to process a request;
Measuring these metrics is pretty straightforward, especially with tools like Prometheus, and using the same metrics for every service helps us create a standard and easy-to-read format for dashboards that have to show the results.
Using the same metrics for every service and treating them the same way, from a monitoring perspective, helps the scalability in the operations teams, reduces the amount of service-specific training the team needs, and reduces the service-specific special cases the on-call-engineers need to remember for those high-pressure incident response scenarios — what is referred to as “cognitive load.”
Infrastructure and the USE method
The USE Method is more focused on infrastructure monitoring where you have the keep the physical resources under control and is based on just three parameters:
- Utilization: the proportion of the resource that is used, so 100% utilization means no more work can be accepted;
- Saturation: the degree to which the resource has extra work which it can’t service, often queued;
- Errors: the count of error events;
While this method at start helped us identify which specific metrics to use for each resource (CPU, Memory, Discs, …), our next task was to interpret their values, and it’s not always so obvious.
For example while a 100% Utilization is usually a sign of a bottleneck and measures had to be taken to lower that, also a constant 70% Utilization could be a sign of a problem, because it hides short burst of 100% Utilization that where not intercepted because the metric was averaged over a period of time longer than the bursts.
The USE Method helped us identify problems which could be system bottlenecks and take appropriate countermeasures, but it requires cautious investigation since systems are complex so when you see a performance problem:
it could be a problem but not the problem.
Each discovery must be investigated with adeguate methodologies before proceeding to check parameters on other resources.
Problems encountered during development
While implementing our new monitoring we encountered two challenges that we had to overcome.
The first challenge was to have a monitoring system fully deployed on containers, and this posed a great question: storage management. Containers does not natively offer persistent storage, so if it is not available for any reason we lose the data stored in it.
As a solution for this problem we found REX-Ray, a project focused on creating enterprise-grade storage plugins for the Container Storage Interface (CSI). REX-Ray provides a vendor agnostic storage orchestration engine. The primary design goal is to provide persistent storage for Docker, Kubernetes and Mesos. Since we use Docker it was a good solution for us.
At first we tried using its Amazon EBS integration but a problem arose: EBS lives on a single availability zone, when the container is moved to another availability zone it will lose the connection with the storage. We then switched to Amazon EFS that is available in the whole AWS Region instead, this allowed us to never lose the link to the storage even when the container was moved around availability zones.
The second challenge was to find a way to automatically and easily or programmatically generate dashboards. Grafana doesn’t offer a great deal of API and we found ourselves with the problem of versioning the configuration and also having to manually repeat patterns to create new dashboards, alerts and so on.
To resolve this and to reduce the amount of manual work we found GrafanaLib, a Python Library from Weaveworks. It allows us to generate dashboards from simple Python scripts that are easily managed and source-controlled.
Future development, are we happy with our new monitoring?
We are happy about how our new architecture turned out, it works and it’s starting to really help us keep our software under control. It provides more information, quicker and unified into easy-to-manage dashboards. In a recent scenario, thanks to the new dashboards for HTTP Services, we noticed an unusual response time from a search engine service when called by a specific client and further investigation lead to understand that there was a particular serie of parameters that slowed down the response. We were then able to quickly address the case and make it return the results in reasonable time again.
We are planning to integrate it with our continuous integration systems, this way when we create a new service a JSON will be automatically generated, that will be scraped by Grafana and the dashboard will be updated without the need to do any manual work.
Our next monitoring-related improvement will be about applicative monitoring, especially for legacy code.
Would you have done things in a different way ? Let us know on comments below.