What is monitoring and why should we care

Ivan Borshukov
7 min readFeb 15, 2018

--

This post covers (part of) the material that I presented on Cloud Foundry day, which was held at Sofia, Bulgaria.

Let’s start from the beginning and see what monitoring actually is. If you talk to unfamiliar person, they’ll most likely imagine this.

How people imagine monitoring.

And if we lookup the formal definition it is not that far.

Monitoring is collecting, processing, aggregating and displaying real-time quantitive data about a system.

But what does quantitive data really mean? We can think of numbers. And those numbers are called metrics. And when we have numbers employed, we usually need some abstractions that help us deal with them. In monitoring this is the case as well. There are three basic abstractions — counter, gauge and distribution or histogram.

Counters

A counter represents a single numerical value that only ever goes up. An example of a counter would be the number of processed HTTP requests or the number of executed database queries. A useful thing about counters is that they let you calculate rates of change, e.g. there are 50 requests per second at this moment.

Gauges

A gauge represents a numerical value that can arbitrarily go up and down. Typical example is the request latency or the time needed to execute a database query.

Histogram (or Distribution)

A histogram samples observations and counts them in different buckets. It could also provide the sum of all observed values. This data could be used for computing statistical functions such as the average, median, percentiles, standard deviation and so on. Typical sampled observations are request durations or response sizes.

Now that we’ve covered what metrics are, lets see where and how we can collect them. When talking about where, there are two approaches — metric collection could be either external or internal to the application.

External metric collection

The first and the most widely adopted approach is to collect metrics from outside of the applications. In this case, collection is done by an external application often called an agent.

Agents collect some predefined set of metrics for our applications by querying the underlying operating system, or in some cases the application itself. Typical examples when we talk about collection from the OS include metrics about the CPU, RAM and file descriptors consumed by the application. However in some cases the application itself is capable of answering questions about its own state — for example if we’re running a Redis instance, the agent could collect metrics about it by running the INFO command

Collecting metrics from outside of applications using a collecting agent.

If you’re looking for an agent that could help you with monitoring, you should first consider checking whether the platform your application is running on comes with out-of-the-box support for any metrics, or if there is an open-source project that implements such functionality already. For example, if you’re running on Kubernetes, you could check the metrics API and the kube-state-metrics project. If you’re running on top of Cloud Foundry, you should check mozzle.

External metric collection

The second approach that we’re going to cover is collecting metrics from inside our applications. Such metrics are called business metrics. While they provides more benefits, they come at the price of code instrumentation — you have to write the code that measures what you’re interested in. And measuring is not enough, you have to make use of some mechanism that allows you to either expose the metrics or transport them to a monitoring system for later processing.

Collecting metrics from inside of application using code instrumentation.

Whether you expose the metrics from the application or transport them to external system depends mostly on the type of monitoring system that you’ll be using. There are two main types of monitoring systems — event based and aggregation based.

Systems and transportation types

The main specific of event-based systems is they operate on individual events such as log messages or instant measurements. All events are stored by the system and can be aggregated into metrics on demand. An example you could think of is using logs produced by a reverse-proxy component (e.g. nginx) to compute summaries about request durations or response sizes. One benefit of having all events is that we can see the exact values for outliers or subsets of the events were interested in. However, recording and storing all events could require a lot of resources such as storage or network bandwidth which in some cases might be undesirable. In such cases we could think about going for aggregation based system.

In contrast to event-based, aggregation-based systems do not operate on individual events but rather on aggregated metrics from those events. This could save us both network bandwidth and persistent storage. The most straightforward implementation is to aggregate individual events into metrics at the application level and only expose (or transport) the aggregated data. While this approach consumes less resources, it also reduces metrics precision.

We’ve mentioned several times the “expose” and “transport” terms, but did not give more details. It deals with how events or aggregated metric data is ingested by the system. We have two approaches — we could either push data into the monitoring system, or make the system pull data out of our application instances.

Pushing vs pulling monitoring data.

In the push approach, each application instance is responsible for reporting its own set of measurements to the monitoring system. This brings in the requirement that all applications know where the system is located. In the pull approach, applications are only responsible for exposing their measurements using a well-defined interface (e.g. via HTTP endpoints such as /varz , /healthz or /debug/vars). The burden of collecting data from each application is left to the system itself. While this reduces application’s responsibilities, it requires that each application instance is addressable, which might not be the case. (e.g. if you’re using an application platform such as Cloud Foundry).

Types of monitoring

Now that we’ve covered most of the technical details about monitoring, I’d like to focus on the incentives for actually doing it. I think we could classify three main “types” of monitoring, based on the incentives.

The first, most-basic, and unfortunately most-widely used approach is manual monitoring. Manual monitoring is employed by individuals whenever they’re debugging a particular issue. The main drawback is that manual monitoring helps only the person that is currently troubleshooting, and it takes time until they find the proper things to monitor. Examples of manual monitoring is running htop on a remote server, counting the number of open file descriptors of a process or interacting directly with the coordinator node of your distributed storage system.

Manual monitoring is bad. Do not rely on it.

The next type of monitoring that I’ve seen widely adopted is the reactive monitoring approach. In reactive monitoring, we learn from our own mistakes and setup monitors for things that have gone wrong in the past. E.g. if our application had a bug and it leaked file descriptors, we setup a monitor that constantly monitors the number of used file descriptors. But this approach also has a critical flaw — these manual monitors are very unlikely to be helpful in isolation, and most often they hide the cause and only show effects of bugs. Let’s go back to the example with the file descriptors and assume that the metrics show very high usage. What’s going on? — If we’re only monitoring the number of used FDs, we can not tell. Either our application is still leaking them, or, in fact, it might be the case that all of those descriptors are used to serve user requests. Or we’ve opened too many connections to the database? We do not know. Avoid monitoring things in isolation.

Can we do better?

Smart people learn from their mistakes. Wise people learn from other people’s mistakes.

We can. By being wise. But what does being wise even mean? It means to be proactive about monitoring — to start thinking about it from day 0, when you actually start your application development. It means to invest effort in instrumenting your own application code and expose business metrics about your system. Display those metrics, observe and study the behaviour of the system and change whatever is necessary, including the metrics itself. Designing and integrating monitoring of a system is an iterative process, just like its development. And the metrics should be crucial to managing your system and even your business. If you don’t believe me, just think about who benefits.

The first, and more obvious, groups of people that benefit from proper monitoring are the development and operations teams. Having monitoring in place allows you to easily answer questions like “Is the application leaking or consuming too much resources”, “Is this new feature that we’ve just launched performing well”, “What’s the system’s overall health”. You could even perform task like capacity planning based on collected metrics.

The second group of people is the business itself. If you’re measuring the correct business metrics, they could aid in answering questions like “Are customers able to use what they’re paying for”. Despite from answering questions about the current state of the system, monitoring data could guide our planning process and answer our questions about the future such as “Should we invest in feature A or feature B”. Last, but not least, we could use those metrics to pinpoints parts of the system which might be optimised in order to reduce infrastructure cost.

If you currently have a service or two, you won’t see the benefit in the beginning. But when the number of services grows, the benefits of having metrics will definitely outgrow the costs. And if you haven’t started early , most likely you won’t have time to start now either.

If you take one thing from this post, it should be this. Be wise. Be proactive. You cannot plug proper monitoring into already built system just like you cannot plug security into it.

--

--

Ivan Borshukov

Cloud developer, student, mountain lover, gopher. Are you enlightened? Views are my own.