THG Tech Blog
Published in

THG Tech Blog

Capturing Application Metrics

There are many solutions available for recording, and displaying application performance metrics. These typically take the form of technical monitoring tools sending data to a time-series database (such as graphite). Visualisation of these metrics has been dominated in recent years with the open-source Grafana dashboard tool.

Monitoring custom metrics

These standard tools shine when monitoring standard technical metrics — requests/second to an end point, number of 20X response, number of 40X responses etc.

The facility to monitor application level metrics — what the actual business logic is doing — is less readily serviced by these default tools. Usually to expose these metrics required writing scripts that would make requests to the underlying service (possibly using a bespoke RESTful end point) and then pushing them to graphite via netcat.

This led to the development of good client libraries that made capturing and exposing custom metrics easier to reduce the burden on the developers, the Java space has been well-served in this area of the years with CodaHale/Dropwizard Metrics, javasimon and Netflix’ Servo (now Spectator) to name just three.

In the THG Warehouse software team we use Spring Boot as our de-facto application framework and it also has good support for monitoring and technical metrics via the actuator framework.

Micrometer

The new kid on the block: micrometer is now the default metrics library for Spring Boot 2.0. Developers already familiar with Dropwizard Metrics should find the transition to micrometer to be fairly straightforward.

Application level metrics

Beyond the technical health of a service, there is often the need to record metrics that are used for other contexts: Are we meeting an SLA? Can we use a metric as input into an automated decision making process?

For these use cases we can use the same libraries and time-series infrastructure that we use for our technical health metrics.

Springboot Starter & autoconfiguration

A Spring Boot facility for sharing common configuration and code is to use autoconfiguration. This feature allows us to define some common libraries that need to be present to record metrics using micrometer. When these libraries are present on the classpath, the autoconfiguration makes the MeterRegistry available.

In this autoconfiguration code we define a common set of tags that will be automatically applied to every metric recorded by the applications. These common tags will allow us to filter and slice the data in the aggregation-tier.

With the MeterRegistry configured as spring autoconfiguration , it is straight-forward to use it in an application, for example:

“Models” of metrics

Spend any time investigating the application monitoring space, and amongst the more linux sysadmin focused tooling — nagios, cactii etc. there will be some more esoteric products, however most tend to fall into one of three broad categories:

  • Push-based — each server pushes data directly to central point. This is the most popular model within a LAN and fits neatly with the original open source tooling of carbon/graphite.
  • Pull-based — a central server polls endpoints or uses JMX or some other management API to collect data. This model was made significantly more popular by Prometheus and works well in cloud-based environments.
  • Streaming — each server emits a constant stream of events that an aggregator subscribes to and samples. More of a niche model, used more frequently in analytics processing, Reimman embodies this approach for system monitoring.

After evaluating each of the models, as a team we decided that the pull-based model would work well for our components.

With the decision made, we added our autoconfiguration, and with micrometer wired into the components, we now had the tools to capture metrics from inside the application and make these application-level metrics available over HTTP to be scraped via our Prometheus cluster.

Prometheus

Prometheus has become the de-facto cloud-native standard for recording and aggregating system metrics via HTTP scraping. This ‘pull model’ of gathering system metrics from running services has benefits to the standard push model.

With each component or microservice now publishing application metrics to a /prometheus endpoint, the next step — install and configure a Prometheus cluster. As we use AWS, we will deploy Prometheus via Cloudformation.

AWS Cloudformation

not that cloud formation

Cloudformation allows us to define infrastructure in a yaml file — infrastructure as code that is stored in git alongside any application code that needs to be deployed.

Quis custodiet ipsos custodes?

Let’s dissect this template a little. We start with some parameters that will be used later in the cloudformation file. The first resources define the AccessRole and the InstanceProfile that our Prometheus service will need to be able access resources and also to push logs to Cloudwatch, and yes the irony of pushing logs from Prometheus (a monitoring tool) to Cloudwatch (another monitoring tool) is not lost on me… something about turtles all the way down…

The following three resources specify the Loadbalancer configuration and DNS record so that the Prometheus server can be accessed externally via HTTP.

The next section defines the LaunchConfig so that we can run Prometheus as a cluster in an Autoscaling Group (ASG). This configuration is important to allow us to have the reliability we want for our metrics and monitoring system.

The ASG is set to have a minimum size of 2 instances and can grow automatically up to 4 instances. The scaling of this is a “nice to have” — the key property of an ASG, that we want to leverage, is that it will automatically detect and replace a failing node in the cluster without intervention.

Provisioning

Each node in the Prometheus cluster is provisioned using a AWS::Cloudformation::Init block. In this section we describe where to fetch the provisioning scripts (in our example we download these from AWS S3) and then pass parameters to the scripts when they are executed. The scripts themselves are defined in Ansible:

The main Ansible file here defines a set of tasks that will be performed by the AWS::Cloudformation::Init step when the ASG creates a new instance. Along with fetching the Prometheus binary, unpacking and installing it on the node, the eagle-eyed amongst you may have noticed that we are also installing another component…

M3

M3DB

Prometheus can run as a single node with local disk based storage of metrics. However we wish to run a cluster (a load balanced pair of Prometheus nodes) and as such we need a long-term storage mechanism as we cannot easily share a single disk across two or more nodes.

Prometheus supports many integrations for this “remote storage”. These range from traditional RDBMS systems (including the ubiquitous Postgresql) to specialised time-series databases — such as InfluxDB or OpenTSDB.

After evaluating these different options we decided to use M3DB as the time-series DB for our long-term metrics storage. Provisioning and configuring M3DB is fairly complex, however we thought it was (for our current use-case) the correct solution for the backing store.

In a similar fashion to the Prometheus cluster, we use Cloudformation to provision the nodes for the M3DB cluster (as this is quite a long yaml document, I’ve elided with … most of it to leave just the most interesting sections):

Here the two largest parts of the configuration are the SecurityGroup definition that allows each of the nodes to communicate with each other and with Consul and Prometheus and the definition of the M3DB seed instances.

Having previously configured a Cassandra cluster and a Druid cluster, a common theme amongst these modern, horizontally-scalable databases is the amount of network ports they require to communicate and keep their state in sync (regardless of where in the continuum they actually fall in Brewer’s CAP theorem). If you have any experience of etcd you may see some similarities also…

M3DB Seeds

The M3DB cluster requires a fixed set of at least three nodes known as seed nodes. Unlike other nodes that will be looked up via our Consul service discovery, these need to have fixed IP addresses.

We pass in the three fixed IP addresses — one for each node in different availability zones. Each of the seed instance definitions in the Cloudformation template is identical, apart from one which has the extra responsibility of initialising the M3 database topology and creating a metrics namespace in the M3 database.

Jinja2 template for M3DB topology
The namespace configuration

Similar to the Prometheus provisioning, the M3DB nodes are installed and configured via Ansible:

Results

Phew! With all the infrastructure provisioned and configured we can now:

  1. Programmatically capture metrics via the micrometer library
  2. Expose them these a /prometheus end point
  3. Scrape the metrics with Prometheus and
  4. Store them durably in a time-series database

--

--

--

THG is one of the world’s fastest growing and largest online retailers. With a world-class business, a proprietary technology platform, and disruptive business model, our ambition is to be the global digital leader.

Recommended from Medium

Deploy Your Portfolio Website with Azure Static Web Apps and Azure DevOps

So you want to make an NFT…

PART 1 — Red Hat OpenStack Platform (OSP) Service Telemetry Framework (STF) with OpenShift…

Loading New Fonts in Flutter App After Deployment — .ttf, .otf

How to Create a Hangman Game Using Python

To mob or not to mob?

Adding a Hubspot Data Layer for Personalization and Profit

Hubspot data layer

Clash of Kings Array Revision — Phase 2

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Kev Jackson

Kev Jackson

Principal Software Engineer @ THG, We’re recruiting — thg.com/careers

More from Medium

Event Streaming Platform

event streaming platform

Portability of applications across Kubernetes distributions, part 2

AWS Account Migration Journey — Part 1

Are you cloud native?