Observability & monitoring — Part 04

Published in

The DevOps Journey

6 min readDec 29, 2018

This post is a continuation of my series of posts on Observability & monitoring.

You can read my previous posts Observability & monitoring — Part 01, Observability & monitoring — Part 02 and Observability & monitoring — Part 03 to keep up with the discussion. These posts are broken down into many parts assuming that it will ease the reading & understanding process. It sure helps the writing process ;)

As promised, we will discuss about the implementation of the metrics monitoring tool kit. Since this is a PoC implementation, I decided to limit the metrics monitoring to a minimal set of metrics. If you need to extend this solution, it’s a matter of adding more exporters to expose additional metrics.

Moreover, this PoC implements a monitoring tool kit which monitors a deployment of Ubuntu servers in AWS consisting of a WSO2 server cluster. Following is the deployment diagram. Anyway, the implementation is pluggable with any other deployment. You might only require to change few metrics exporters to collect metrics specific to your system.

Now, we’ll discuss about each tool covering the life cycle of a metric. Let’s keep the below diagram for reference.

Metrics Exposure — Prometheus Exporters

Following is the set of initial metrics this monitoring toolkit covers along with the Prometheus exporters used. Since these are a set of common metrics, the readily available exporters[1] were able to cater the requirement.

https://gist.github.com/Dilhasha/e609c441efc634f9e1bf62e71d9ed670

Node Exporter

The node exporter[2] exports metrics to monitor the status of a node. Metrics for Load average, Memory usage and total number of processes running are readily available.

Node exporter also supports text file collectors. A text file collector is capable of exporting the metrics written to a file in a previously configured directory.

In our PoC, the login test is done using the text file collector. There is a component which is capable of doing a login test against the WSO2 server that is running in a node. In order to append the login metrics to a text file every minute, a cronjob is deployed in the Ubuntu servers which makes use of the login component and writes the result to a text file in Prometheus’s text file collector directory.

If you are using any other service than WSO2, then you might want to design the login test following the above sample.

JMX Exporter

WSO2 servers are Java based servers and have JMX enabled by default. JMX exporter[3] is a Java agent which can be added to JAVA_OPTS in the server. In the case of WSO2 servers, you can add it to the wso2server.sh as follows.

-javaagent:$Jmx_exporter_agent_path=<port_to_expose_metrics>:$Jmx_exporter_config_path \

You can refer the post[4] by one of my colleagues to learn the specifics on configuring this exporter for WSO2 servers.

Blackbox Exporter

The Blackbox exporter[5] can be used to perform TCP, HTTPS connections and export the success or failure as a metric. In this case, we are using the Blackbox exporter for checking the accessibility of the port 9443, which is the default port in which WSO2 servers are running. Blackbox exporter configuration only contains the modules that are supported by the exporter. The other details are added in the Prometheus configuration.

Following is the Blackbox configuration that is required for our requirement.

modules:
tcp_connect:
prober: tcp

MySqld Exporter

The Prometheus’s Mysqld exporter[6] is capable of monitoring MYSQL server health based on many metrics. In our case, we will only use the metric result by performing a select query against a particular MySQL server. Connecting to your databases require credentials which can be passed as environment variables or hidden file. Prometheus also supports exporters for many other database solutions[7] as well.

Metrics Collection— Prometheus Server

Prometheus server requires configurations for discovering the targets that need to be scraped.

Scrape configs

In order to scrape metrics which are exposed by the exporters in previous section, we need configurations to specify different jobs(tasks) for the metric collection. Following is a sample job configuration in Prometheus config file for the node exporter.

- job_name: 'Node Exporter'
  ec2_sd_configs:
    - region: "eu-central-1"
      profile: "DemoISSetup"
      port: 9100

To discover nodes in a given AWS deployment many properties[8] such as region, availability zone, instance type or instance tag can be used. AWS API keys or a named AWS profile can also be used to discover targets.

In our example the region tag is being used. When this configuration is present in Prometheus, it will scrape metrics from port 9100 of all the nodes in AWS ‘eu-central-1’ region.

Prometheus also supports target discovery in many[9] other widely used deployments such as Azure, Kubernetes and OpenStack. In generic cases it also supports file based and static target discovery.

Tag based configs

Prometheus also supports tag based[10] target selection. That is based on the instance tags in AWS, you can either keep or drop an instance from monitoring. See below examples.

Metrics Storage— InfluxDB

You need to create a database in InfluxDB[11] to store the metrics collected by Prometheus. Then, Prometheus server needs to be configured to read and write to a remote storage as follows.

remote_write:
- url: “http://localhost:8086/api/v1/prom/write?db=prometheus"

remote_read:
- url: “http://localhost:8086/api/v1/prom/read?db=prometheus"

Metrics Visualization—Grafana

As the first step, we need to add InfluxDB as a data source in Grafana as follows.

Then, based on the metrics from above data source, we can create dashboards to monitor various metrics of the monitored system. We can create separate panels for the metrics we need to monitor.

Alerting — Pagerduty

We need to create a service in Pagerduty as follows to send the notifications to the on call person(s) via multiple channels.

Then, we need an integration key to identify the service

For the PoC, I have used alert configurations built in Grafana to check the threshold levels and trigger the alert to Pagerduty. So, when we have multiple alert configurations per panel, this will evaluate the metrics values and trigger notification to Pagerduty. For this, we need to configure Pagerduty as a notification channel. Here, we need to add the integration key we obtained from the API service in Pagerduty to send the alerts to that service.

With these configurations in place, we have our basic monitoring tool kit ready. If you have other tools that are already used in your organization for any part of the metrics life cycle, this tool kit allows the capability to plug that tool without much effort.

I will be discussing about more tips and tricks on improving this tool kit as well as automating the effort to gain the best out of this solution. Stay tuned :)