Building Metrics into your Infrastructure is not good enough!

Nir Alfasi
Israeli Tech Radar
Published in
3 min readJun 8, 2020
Example of a Prometheus Dashboard

Metrics can be divided into two categories: infrastructure-metrics and business-metrics.

Infrastructure-metrics include signals about the system/os such as: memory, disk, CPU etc. Google defines a 4 golden HTTP metrics: traffic, error rate, latency and throughput/saturation — which may be considered as Infrastructure-metrics.

Business-metrics provide visibility into the business/application.

A few examples:

  • number of visitors ≠ the number of HTTP calls
  • Conversion-rate is not always trivial to measure using infrastructure-metrics.
  • A/B testing is very difficult if possible at all to be achieved using infrastructure-metrics

In order to be able to publish business-metrics we have to provide our engineers an easy-to-use mechanism from the application, there’s just no way around it!

The good news is that it’s not very difficult!

One of the smart things that my friend, Shahar Solomianik, did as one of his recent projects at Hippo was to create a generic HTTP-client (built on top of axios) and replace all the HTTP calls in the project to use this HTTP client.

The obvious benefit of doing so was consolidating the technologies used for the same purpose: before that many other libraries were used for the same purpose: request, request-retry, http/s, request-promise, fetch and maybe others.

As a bonus we got the metrics that the HTTP-client publishes (traffic, error rate and latency). Now any new & existing HTTP call will publish all these (infra) metrics “out of the box”!

Publishing metrics from the application should not be difficult if it’s well thought and planned. For example, when dealing with Prometheus, it works great for micro-services which can expose a /metrics endpoint on their server (Prometheus scraps that endpoint to collect metrics), but when you want to publish metrics for a batch-job or any offline-process, one has to go out of their way to setup and deploy a pushgateway which is required due to the offline nature of the task.

I’m not saying that having metrics embedded into your infrastructure is necessarily a bad thing, the main claim here is that it’s not sufficient: infrastructure-metrics are complementary to application-metrics - they cannot substitute one another: while infrastructure-metrics may signal an abnormal behavior, it wouldn’t point you to the source of the issue. Further, it could be that there was no issue at all and what you’re seeing is a transient network glitch!

The biggest advantage of business-metrics is that they provide context which is invaluable when you’re investigating a production issue.

At Hippo we use Kubernetes & Prometheus to publish metrics (displayed in Grafana) and there are many plugins that automatically publish infra-metrics, and we do use them!

In fact, we encourage engineers to consider and publish application-metrics for any new feature they build: we’d like to have new counters / gauges / histograms that provide better visibility into the core of the business and at the same time provide infrastructure-metrics “out-of-the-box”, meaning, make sure our engineers don’t have the mental burden of publishing infrastructure-metrics so they can focus on publishing good business-metrics.

To show how easy it is to publish metrics with Prometheus in Node, here’s a small example that runs on one of our workers:

const dailyProcessCounter = generateCounter({
name: 'daily_process',
desc: 'Counts the number of daily processes that started/ended/error'ed',
options: { labelNames: ['status'] }, // (labels = dimensions)
});
...try {...} catch (e) {
dailyProcessCounter.inc({ status: ReportMetrics.error });
...
}

If you found this article interesting and you’d like to help us solve problems such as this one: Hippo is hiring!

--

--

Nir Alfasi
Israeli Tech Radar

“Java is to JavaScript what Car is to Carpet.” - Chris Heilmann