Monitoring in Docker Stacks - It’s that easy with Prometheus!

Soumyadip De
6 min readJun 17, 2017

--

Until this comes into rescue...let’s have some open-source tools save you!

You might have heard about docker and how it is revolutionizing every corner of software industry — be it in SDLC process or microservices architecture. Having a self-sufficient container for each of your logically independent modules makes your software resilient to traffic spikes and emergency software patches.

But as life has taught us already, Everything comes at a cost! Having multiple ephemeral containers — created and destroyed as per load — make it very hard to keep track and monitor applications. Monitoring has been a major concern for containers from the start causing many solutions to come forward in recent past.

There are few paid solutions are available addressing Container monitoring, of which one of the most popular would be Datadog. But as we, folks at OeCloud.io, believe in open-source for quick access to innovation, we looked up for solutions in open-source communities and found Prometheus, backed by Cloud Native Computing Foundation, to be a good fit for our need.

Now that docker daemon also supporting exporting metrics to prometheus in latest version (link), it’s a matter of time before most of us in community adopts prometheus.

Now what made docker to choose prometheus over other open-source alternatives comes down to how prometheus collects, stores and represent metrics.

Features:

  • a multi-dimensional data model (time series identified by metric name and key/value pairs)
  • a flexible query language to leverage this dimensionality
  • no reliance on distributed storage; single server nodes are autonomous
  • time series collection happens via a pull model over HTTP
  • pushing time series is supported via an intermediary gateway
  • targets are discovered via service discovery or static configuration
  • multiple modes of graphing and dashboarding support

Components:

Prometheus Architecture (Yes, copy-pasted from official docs!)

Now that introduction with our friend Prometheus is settled, let’s guide you through each of its components to know it better.

  1. Exporters: Prometheus server takes help from various exporters to collect metrics (Pull mechanism). In case of Docker stack/swarm, we need two kind of exporters,
    I. Node-exporter: This is used to collect Node/Server related information such as server CPU/Memory/Storage/Network utilization. This is supported by prometheus community. (link)
    II. Container-exporter: Now this is our saviour for monitoring docker containers. At the time of writing this article, metrics from docker engine are not detailed enough. So we use Google community supported cAdvisor for container metrics. It is well detailed and each container CPU/Memory/Storage/Network utilization can be collected from it. (link)
    III. Custom-exporter: You can use custom exporters for various components you use in your eco-system like HAProxy, Apache, Elastic-search etc. You can find your specific tool exporter if someone already built it or you can build on your own following this guide.
  2. Prometheus Server: The heart of prometheus monitoring system which pulls metrics from exporters and stores in sophisticated local storage subsystem which has powerful querying support. It has simple UI where you can check status of each exporter endpoint and test queries. (link)
  3. Grafana: Grafana is a visualization component that supports querying to prometheus and show various graphs depending on query. It recently included alerting feature which can also come handy but there is separate component in prometheus to handle alerts. (link)
  4. Alertmanager: This is a component where prometheus server pushes pre-configured alerts via configuration file. Alertmanager handles routing of alerts to intended recipients, stacks duplicate alerts and suppresses alert based on conditions. (link)

Configuration:

Now the part comes where you’ll be guided on how to befriend with prometheus for docker stacks monitoring.

  1. Node-exporter: You can run global docker service of node-exporter exposing 9000 port of container.
docker service create --name node-exporter --mode global -p 9100:9100 prom/node-exporter

2. cAdvisor: cAdvisor is available as a docker image and can be run as a docker service.

docker service create --name cAdvisor --mode global -p 8080:8080 google/cadvisor

But when we tried to implement it, we faced a major issue in running cAdvisor as container which forced us to install cAdvisor as a application in host machine and run as a service (guide).

3. Alertmanager: Alertmanager can run as docker service in replicated mode exposing 9093 port. Alertmanager needs a configuration file that can be made following this guide. The following command assumes you have alertmanager.conf in currrent directory.

First create overlay network for communication,

docker network create -d overlay prometheus_network

Then service creation,

docker service create \
--name alertmanager \
-p 9093:9093 \
--network prometheus_network \
--mount\
target=/config/alermanager.conf,\
source=./alertmanager.conf,type=bind \
prom/alertmanager:v0.6.2 -config.file=/config/alermanager.conf

4. Prometheus server: You need to pass a configuration file to promethues server like this,

global:
scrape_interval: 15s
evaluation_interval: 15s
rule_files:
- "alert.rules"
scrape_configs:
- job_name: docker-engine-metrics
static_configs:
- targets: ['10.X.X.1:9323','10.X.X.2:9323']
- job_name: container-metrics
static_configs:
- targets: ['10.X.X.1:8080','10.X.X.2:8080']
- job_name: node-metrics
static_configs:
- targets: ['10.X.X.1:9100','10.X.X.2:9100']

Alertmanager rules also needs to be added as a file as alert.rules file. Sample alert.rules file can be as,

# Alert for any instance that is unreachable for >5 minutes.ALERT InstanceDown
IF up == 0
FOR 5m
LABELS { severity = "page" }
ANNOTATIONS {
summary = "Instance {{ $labels.instance }} down",
description = "{{ $labels.instance }} of job {{ $labels.job }} has been down for more than 5 minutes."

Now for running prometheus server assuming configuration file is in ./prometheus/prometheus.conf,

docker service create \
--name cep-prometheus-server \
--network prometheus_network \
--mount target=/config,source=./prometheus,type=bind \
prom/prometheus:v1.6.3 \
-config.file=/config/prometheus.conf \
-alertmanager.url=http://alertmanager:9093
Prometheus server targets screen (No, it is not supposed to be down!)

5. Grafana: Grafana also comes as docker image and can be run as docker service.

docker service create --name grafana-server grafana/grafana:4.2.0

Setting up dashboards can be found at this link. Pre-built dashboards can be found on this link.

Grafana dashboard (Yes, we are under-utilizing our servers by miles!)

For this guide, the dashboard being used is this. To group containers based on stack/service, you can use docker labels on containers to group them. For instance, network output of containers can be grouped by service-name having “your-app” in it by following query,

sort_desc(sum by (container_label_com_docker_swarm_service_name) (rate(container_network_transmit_bytes_total{image!="",container_label_com_docker_swarm_service_name=~".*your-app.*"}[1m] ) ))

Here ‘Metric lookup’ would be container_network_transmit_bytes_total and ‘Legend format’ as {{ container_label_com_docker_swarm_service_name }}.

If you want to group by stack namespace, that can also be done by replacing service-name filter with container_label_com_docker_stack_namespace.

Conclusion:

Prometheus has been serving us nice for some time and you may need to go through prometheus docs to tweak it as per your need such as metrics retention period or persistence of data.

Some may not agree with the heading containing “..that easy!’’ following this guide. For them, I would suggest to go to our open-source contribution on Github which automates docker swarm cluster setup along with monitoring and logging components. The automation steps for monitoring include,

  1. Generation of configuration file(s) from host ips and installation
  2. Opening necessary ports of firewall on each of the hosts
  3. Deploy necessary components for monitoring
  4. Copy pre-built dashboards in grafana for quick setup
  5. Provide URL to access grafana/prometheus/alertmanager (like https://grafana.your-domain.com) using HAProxy router.

So you’ll be able to get things up and running in no time and focus more on your application rather than infrastructure setup!

You can connect with me on my linkedin profile.. Happy Monitoring!

--

--