Observability & monitoring — Part 03

Fathima Dilhasha
The DevOps Journey
Published in
3 min readDec 24, 2018

This post is a continuation of the posts Observability & monitoring — Part 01 and Observability & monitoring — Part 02.

These previous posts discussed the basics of observability, metric monitoring and defined the life cycle of a metric.

In this post, we will discuss how the tools covering the main stages of the metrics monitoring toolkit were evaluated and selected.

When designing the metrics monitoring toolkit, the goal was to design a system suitable for dynamic deployments, considering modern design paradigms addressing micro service, container based architectures. The toolkit should also be adaptable to any infrastructure let it be a standard on premises deployment, a deployment hosted in Cloud or a container based deployment.

Based on the basic requirements few tool stacks were shortlisted and evaluated. Following are the stacks.

  • Prometheus with Grafana and Alert manager
  • TICK stack (Telegraf, InfluxDB, Chronograf, Kapacitor)
  • Icinga2
  • Sensu and Graphite based stack

These tool stacks were compared based on many guidelines. Following is a summary of the comparison.

https://gist.github.com/Dilhasha/4d191d5cd2f17c89b9c97668fdf89957

Metrics collection can usually take the push or pull model. In push model, the metrics collection tool just sits there while the plugins do the collection. But, in pull model the metrics collector is responsible for scraping/retrieving the metrics from the defined endpoints. While Prometheus, Telegraf and Icinga2 support the pull model, Sensu does not support the pull model because it has no inbuilt collecting capabilities. Telegraf and Prometehus also support the push model. But Prometheus is best used in pull model while Telegraf in push model.

When it comes to storing the metrics, Prometheus has an inbuilt time series database. But it is also capable of storing metrics in other external storage[1]. TICK stack uses InfluxDB[2] for metrics storage. InfluxDB supports data retention policies per database and supports querying the metrics too. On the other hand, Icinga2 does not save metrics and only saves the aggregated values as per the requirement. Graphite uses a numeric time series database called Whisper[3].

Even though there are dashboarding and alerting solutions provided by Prometheus and the TICK stack, we can also integrate other solutions to these systems. When using in a production setup, node discovery capabilities of the solution is very vital. Telegraf and Sensu does not require node discovery for the metric collector as they are using the push model. Icinga2 takes PuppetDB as Import source or manage hosts with Ansible or can use Foreman[4] for host discovery. Prometheus handles node discovery using configs[5]. It has configuration support for many infrastructures.

Prometheus supports a wide list of exporters and also allows writing exporters if required. Telegraf plugins can be written as per the requirement too, if the provided plugins are not sufficient. Nagios check commands[6] can be used to write checks in Icinga2.

Based on above comparison, it was decided to use the below stack as the metric tool kit. This is using various solutions across the compared stacks to achieve the best outcome. It was decided to use the Pagerduty trial as the alerting solution as it is already being used by many companies for incident management.

  • Prometheus as metric collector
  • InfluxDB as metric storage solution
  • Grafana as metrics visualization solution
  • Pagerduty as alerting solution

Following is the basic architecture of the metric monitoring toolkit.

Architecture of Metric monitoring tool kit

Prometheus exporters are responsible for exposing existing metrics from third party systems. They are light-weight processes written in Go. Prometheus server which follows the pull model for metrics collection and is designed for reliability. Prometheus can be easily integrated to many supporting tools.

InfluxDB is a time series database(TSDB), which provides a SQL like query language. In InfluxDB, a metric corresponds to a measurement in the Database. Grafana is a very widely used visualization tool which supports a wide range of data sources including InfluxDB databases. Grafana also includes a built in query parser and provides basic alert configurations as well. Pagerduty allows incident classification and supports multi channel notifications.

I will be discussing the implementation in detail in my next post. Stay tuned :)

Update:

Next post at https://medium.com/the-devops-journey/observability-monitoring-part-04-8742a06caff4

References

[1] https://prometheus.io/docs/operating/integrations/#remote-endpoints-and-storage

[2] https://www.influxdata.com/time-series-platform/influxdb/

[3] https://graphite.readthedocs.io/en/latest/whisper.html

[4] https://www.icinga.com/2017/04/26/automated-monitoring-icinga-meets-foreman/

[5] https://prometheus.io/docs/prometheus/latest/configuration/configuration/#configuration-file

[6] https://www.nagios.org/projects/nagios-plugins/

--

--