Monitoring a server cluster using Grafana and InfluxDB

Six billion requests per month. Does this number sound large to you? This number is the total number of requests made in a month on a very popular website that every developer knows : StackOverflow.

Two HAProxy servers and nine Web Servers are the building blocks of StackOverflow, implementing a well-known and robust architecture known as high-availability server clusters.

On a more minor scale, such architectures are implemented by many companies to ensure the availability of a service even in case of hardware defects. But in the case of a server shutdown, how can one be notified in an efficient way in order to take quick actions to bring a server back?

Today, we are going to build a complete Grafana dashboard monitoring a server cluster using InfluxDB and Telegraf.

The final dashboard we are going to build

I — About High-Availability Clusters

Before jumping into the development of our dashboard, let’s have a quick word on high-availability clusters, what they are used for and why we care about monitoring them (if you’re an expert in entreprise architecture, you can skip this part)

A high-availability cluster is a group of servers designed and assembled in a way that provides permanent availability and redundancy of the services provided.

Let’s say that you are building a simple web application. On launch, you get a traffic of a thousand page views per day, a load that any decent HTTP server can handle without any trouble. Suddenly, your traffic skyrockets and jumps to a million page views per day. In this case, your basic HTTP server can’t handle the load all by itself and needs additional resources.

A solution to this problem would be to implement a HA cluster architecture.

When a HTTP request is received, the load balancer will proxy (i.e forward) the request to the appropriate node to ensure that the load is equally divided among the cluster. Load balancers use different kind of techniques to decide which node to send to, but in our case, we will use an unweighted Round Robin configuration : a request will be sent to the first server, then the second and so on. No preference will be made regarding the node to choose.

Now that we have defined all the technical terms, we can start to implement our monitoring system.

II — Choosing The Right Stack

For this project, I will be using a Xubuntu 18.04 machine with a standard kernel.

In order to monitor our server cluster, we need to choose the right tools. For the real-time visualization, we are going to use Grafana v6.1.1

For monitoring, I have chosen InfluxDB 1.7 as a datasource for Grafana as it represents a reliable datasource for Grafana. In order to get metrics for InfluxDB, I have chosen Telegraf — the plugin-server driven agent created by InfluxData. This tutorial does not cover the installation of the tools presented ahead, as their respective documentation explains it well enough.

Note : make sure that you are using the same versions as the ones used in this tutorial. Such tools are prone to frequent changes and may alter the validity of this article.

III — Setting Up A Simple HA Cluster

In order to monitor our HA cluster, we are going to build a simple version of it using NGINX v1.14.0 and 3 Node HTTP server instances. As shown in the diagram above, NGINX will be configured as a load balancer, proxying the requests to our Node instances.

If you already have a HA cluster setup on your infrastructure, feel free to skip this part.

a — Setting NGINX as a load balancer

NGINX is configured to run on port 80, running the default configuration, and proxying requests to services located on port 5000, 5001 and 5002.

NGINX configured as a simple load balancer

Note the /nginx_status part of the server configuration. It is important not to miss it as it will be used by Telegraf to retrieve NGINX metrics later.

b — Setting simple Node HTTP instances.

For this part, I used a very simple Node HTTP instance, using the http and httpdispatcher library provided by Node natively.

A simple HTTP server written in Node

This server does not provide any special capabilities but it will be used as a basic web server for NGINX to proxy requests to.

In order to launch three instances of those web servers, I am using pm2 : the process manager utility for Node instances on Linux systems.

Now that NGINX is up and ready, let’s launch our three instances by running :

Doing this to the two other instances of Node servers, we have a cluster of three Node nodes up and ready.

Our three node are up and running!

IV — Setting Up Telegraf For Monitoring

Now that our HA cluster is built and running, we need to setup Telegraf to bind to the different components of our architecture.

Telegraf will be monitoring our cluster using two different plugins :

  • NGINX plugin : used to retrieve metrics for NGINX servers such as the number of requests, as well as the waiting/active or handled requests on our load balancing server.
  • HTTP_Response : used to periodically retrieve the response time of each node, as well as the HTTP code associated with the request. This plugin will be very useful for us to monitor peaks on our nodes as well as node crashes that may happen.

Before starting, make sure that telegraf is running with the following command : sudo systemctl status telegraf . If your service is marked as Active , you are ready to go!

Head to Telegraf default location for configuration ( /etc/telegraf ), edit the telegraf.conf file and add the following output configurations to it.

Configuration for the NGINX plugin of Telegraf
Configuration for each individual node

When you’re done with modying the configuration of Telegraf, make sure to restart the service for the modifications to be taken into account. ( sudo systemctl restart telegraf ).

Once Telegraf is running, it should start sending periodically metrics to InfluxDB (running on port 8086) in the telegraf database, creating a metric called by the name of the plugin running it. (so either nginx or http_response ).

If such databases and measurements are not created on your InfluxDB instance, make sure that you don’t have any configuration problems and that the telegraf service is correctly running on the machine.

Now that our different tools are running, let’s have a look at our final architecture before jumping to Grafana.

Current Architecture Using Telegraf & InfluxDB

V — Setting Up Grafana — Finally!

Now that everything is setup on your machine, it’s time to play with Grafana and to build the coolest dashboard to monitor our infrastructure.

Given the dashboard presented above, you have three visualization options :

  • An availability diagram : showing connections between nodes, their current statuses and response times.
  • Gauges showing response times, or an error message is the service fails.
  • A graph of each node response time, and a configurable alert if the service fails.

You can skip to the parts that you judge relevant for your own dashboard.

0 — Prerequisite Steps

You should have a completely blank dashboard, by finding the Create Dashboard button of Grafana left menu.

a — Availability Diagram

This visualization option shows the connection between the different parts of our architecture, as well as color nodes to represent statuses of each individual node.

A little bit of configuration, as the diagram plugin is not natively available on Grafana, but it should not be too hard to install.

In your command line, run sudo grafana-cli plugins install jdbranham-diagram-panel and restart Grafana. Now that you successfully installed the plugin, it should be available in the visualization bar of Grafana.

The diagram panel uses the Mermaid syntax that we are going to use to depict our architecture. For convenience, I have assembled the configuration in the single picture.

The mermaid syntax for this plugin should be configured as described below

Important note : there is a little parameter that you need to tweak in Grafana for you to get the desired result. By default, if you define the query as the one presented above, Grafana will take the last value relatively to the time interval of the window. This is an issue, because if you set your Grafana dashboard to Last 15 minutes with a Refresh every five seconds option, you will get the notification that your service is down 15 minutes later.

To counter that, you need to find this line in your configuration :

This way, you will get the alert only 20 seconds after the service went down.

And we are done with the availability diagram! Let’s have a look at the diagram if we stop one of the services by pm2 stop server

Quite what we expected! Unfortunately, the diagram panel does not provide any alerting system, but we will be able to send alerts via the graph panel.

b — Gauge Panel

This gauge panel is a new panel available in Grafana v6.0, and is configured as such :

The gauge panel is binded to the response_timefield and thresholds are configured to notify when a certain node is taking too much time to respond. Alerts are not yet configurable on the gauge panel.

c — Graph Panel

The graph panel shows response times and is quite similar compared with the gauge panel shown above. A major difference with the gauge panel is that it provides alert configurations when a certain value is not available.

All the widgets presented above can be tweaked to adjust to your needs, and they should be! Their configuration depicts the very special case of a HA cluster, but they may differ if you are using HAProxy, or a different load balancing technique.

VI — Conclusion

The need for monitoring in 2019 is definitely a trend that no engineer can ignore anymore. There is a growing need for monitoring solutions but also for tools that consolidate monitoring agents, visualization tools and post-analysis tools.

Some great tools are coming for DevOps : InfluxDB is already migrating to version 2.0 with the Flux language and things are looking very promising.

Nonetheless, InfluxData is not the only time series database provider on the market : Amazon definitely wants its share and announced Amazon Timestream during re:Invent 2018. Some features were unveiled, but it can represent a very solid alternative to InfluxCloud.

Time will tell!

I hope that you had a great time reading and building this cluster monitoring dashboard with me : if you want specific topics to be covered in the future, feel free to notice me about it.

Until next time.

Kindly,

Antoine Solnichkin.