Data Center Network Infrastructure Monitoring with TIG Stack

Published in

Trendyol Tech

6 min readApr 16, 2020

In the modern data center environments providing a stable and reliable network is the key to the successful infrastructure management. But how could you possibly know that your network is steady and all your devices running with -desirably- %99.9 uptime? The answer to that is monitoring. It’s the first thing you build right after completing the configurations.

A successful monitoring should be able to warn you before somethings are about to go wrong, long before the problem itself arising.

In order to create an efficient monitoring system we’ll be looking for the answers of these questions;

How many-what kind of devices we have/monitor?
Which metrics do we need to monitor?
Which is the suitable monitoring platform for our infrastructure?
How do we install and configure the selected platform?

How many-what kind of devices we have/monitor?
Since we’re building the monitoring for underlay part of the infrastructure; our main focus is to include the switches, firewalls, packet brokers and bypass units; and the website’s uptime, response time, packet loss and dns query outputs are important as well.

What metrics do we need to monitor?
There are a lot of information in a device and some are essential to keep track of. As you can guess we start the list with the device’s physical components and states;
- CPU, Memory, Disk utilizations,
- System Uptime,
- Fan,
- Temperature,
- Power Supply and
- ASICs -manufactured by the vendor and have spesific purposes and vary from device to device-
All connected ports should be monitored with;
- Throughput (both bps and pps values)
- MTU size
- Packet Drop/Error

Which is the suitable monitoring platform for our infrastructure?
In the golden age of technology there are numerous choices of a single need with both paid and open-source options and selecting the right platform for monitoring is no different. In the sea of countless softwares we’re looking for a product that; can work with multiple databases and sources simultaneously, is flexible and scalable, supports both agent based and agentless monitoring. Which steered us to widely known and used TIG Stack.

Our goal is the centralize the whole network monitoring into a single platform which allows us to troubleshoot and see the metrics a lot more organized and fast.

What is TIG Stack?
Telegraf - Plugin driven metric collector
InfluxDB - Time series database
Grafana - Open-source dashboard builder

How do we install and configure the selected platform?
We need a virtual machine runs Centos 8 and install the platform starting with Influx Database, Telegraf agent after and Grafana last. Note that the installation and configuration require root privileges.

InfluxDB
Time series databases specially designed for the data with a time-stamp which allows the user to faster and effective query the databases for time series data such as metrics, logs, throughputs etc.

By default InfluxDB listens to 8086 port for TCP for HTTP Api requests. (http://localhost:8086)

First of all we’re adding the influx db repository for the packages to install.

cat <<EOF | sudo tee /etc/yum.repos.d/influxdb.repo
[influxdb]
name = InfluxDB Repository - RHEL \$releasever
baseurl = https://repos.influxdata.com/rhel/\$releasever/\$basearch/stable
enabled = 1
gpgcheck = 1
gpgkey = https://repos.influxdata.com/influxdb.key
EOF

After that we can install and start/check the status of the influxdb service and make sure that it automatically runs when system rebotted.

sudo yum install influxdb
sudo systemctl start influxdb
sudo systemctl status influxdb
systemctl enable influxdb.service

Telegraf
Telegraf is plugin driven which allows you to direct the monitoring platform in the way your needs and flexible enough to change the course from time to time.

Telegraf agent uses the same repository for InfluxDB and since we’ve already defined that, we can move on with the installation following the similar steps.

sudo yum install telegraf
sudo systemctl start telegraf
sudo systemctl status telegraf
systemctl enable telegraf.service

Default configuration file is located under; /etc/telegraf/telegraf.conf and plugins can be configured under Output-Input Plugins. We can use the test feature after configuring the said file.

telegraf --config /etc/telegraf/telegraf.conf --test

Configuring InfluxDB Output for Telegraf

####################################################################
#                            OUTPUT PLUGINS                        #          
##################################################################### Configuration for sending metrics to InfluxDB
[[outputs.influxdb]]
urls = ["http://localhost:8086"]
database = "yourdatabasename"
username = "snmp-test"
password = "Password1"

Configuring SNMP Input for Telegraf

SNMP is protocol that allows us to get metrics from the monitored devices using OIDs and Tables. SNMP input uses net-snmp utilities and should be installed beforehand. Installing net-snmp the deafult MIBs are automatically install into the /usr/share/snmp/mibs and any other MIB file you integrate should be copied in the same directory.

yum install net-snmp net-snmp-utils
systemctl enable snmpd

In this example; telegraf agent polls 192.168.1.1 and 192.168.1.2 with snmp protocol v2c and community snmp-test for hostname output.

####################################################################
#                            INPUT PLUGINS                         #          
#####################################################################Retrieves SNMP values from remote agents
[[inputs.snmp]]
   agents = [ "192.168.1.1:161", "192.168.1.2:161" ]
   timeout = "500s"
   interval = "300s"
   version = 2
   retries = 3
   community = "snmp-test"
   max_repetitions = 100
   name = "snmp-test"

Grafana
Grafana is the perfect solution for creating meaningful and effective dashboards with your metrics. Grafana listens port 3000 to connect its GUI.
(http://192.168.1.1:3000)

Like with the InfluxDB and Telegraf, we start with configuring the repository file.

sudo nano /etc/yum.repos.d/grafana.repo
[grafana]
name=grafana
baseurl=https://packages.grafana.com/oss/rpm
repo_gpgcheck=1
enabled=1
gpgcheck=1
gpgkey=https://packages.grafana.com/gpg.key
sslverify=1
sslcacert=/etc/pki/tls/certs/ca-bundle.crt

And continue with the installation.

yum install grafana
systemctl start grafana-server
systemctl status grafana-server
systemctl enable grafana-server.service

After successfully installing, you can login to the Grafana GUI with the address of your server (http://your-server-ip:3000)(username/password: admin) and follow the steps in the Welcome page.

Creating a Datasource

Following the create datasource button directs you to page that you can connect to your database with pre-defined datasource templates. We’re choosing InfluxDB.

We’re naming the datasource, entering the ip-address for the InfluxDB with port 8086, access method should be Server. Since there we’re no certificates created we’ll be proceeding with the basic authentication which requires the username and password for the database you’ve created.

And clicking Save&Test button should be resulted with the output; “Data source is working.”

Building Dashboards

Now it has came to final step of the configuration which is building a dashboard for making the most of the metrics you’ve collected. Clicking build a dashboard button sends you to a blank dashboard with a single empty panel. You could also click the Add Panel button right above for adding new panels.

After clicking Add Query you’ll see a query panel with requiring a lot of data but don’t be discouraged it’s fairly simple. :) In this example we’ll be creating a panel for monitoring the 15 minutes CPU Load Average metric from one of our switches. We’re going to be using singlestat panel type which I think is the best choice for the realtime data.

From -> This is where you define from which source you want to get the data. default means influxDB and xxx-resources is my database table where I stored the metrics of the physical states, agent_host is the switch I wanted to monitor.

Select -> Fields are the types of metrics you’ve collected from the devices.