Monitor your NodeJS microservice app with Grafana, InfluxDB and StatsD

Diving into Microservice architecture, I figured out how important it is to monitor services health so I decided to write this complete tutorial on this subject.

In the past few days, I came across an amazing free graph and online dashboard builder: Grafana. Not only the generated dashboards are good looking but the guys behind Grafana make a tremendous job on making query editor so simple and powerful. Check out the online demo to make your opinion.

Then, a new database name pops up recently: InfluxDB which is a brand new NoSql database specialized in storing time-serie data. So basically, InfluxDB will be the data source of Grafana.

Finally, InfluxDB needs to be populated with your data. There are several options but if you are looking for minimal instrumentation of your source code, StatsD is a good option. StatsD is a basic protocol written by Etsy and implemented by different servers and clients. I have chosen Telegraf for my StatsD server. Telegraf is a data collector service written by InfluxData, the company behind InfluxDB. Both Telegraf and Influx DB are written in Go language which made them very efficient and easy to install.

So to summarize, we have:

High level architecture

Deep dive into the installation on Ubuntu 14.04

Now we will go deeper on how to install everything on Ubuntu 14.04. All the tools mentioned have community editions and are open sourced on Github. (don’t forget to update the url of the deb file with the latest).

InfluxDB installation

cd /tmp
wget https://s3.amazonaws.com/influxdb/influxdb_0.10.0-1_amd64.deb
sudo dpkg -i influxdb_0.10.0-1_amd64.deb
# Start the service
sudo service influxdb start
# Make sure it starts with the OS
sudo update-rc.d influxdb enable

Now you can open the web interface: http://localhost:8083/ but let’s go back to the bash and create the minimal setup with the Influx CLI

InfluxDB Basic configuration

We will create 3 users (admin, telegraf and grafana) with different scopes, 1 database with 6 month retention:

influxCREATE USER admin WITH PASSWORD '<your admin password>' WITH ALL PRIVILEGESCREATE DATABASE telegraf_databaseCREATE RETENTION POLICY six_month_only ON telegraf_database DURATION 26w REPLICATION 1 DEFAULT# Define a read/write user
CREATE USER telegraf WITH PASSWORD '< define a password>'
GRANT ALL ON telegraf_database TO telegraf
SHOW GRANTS FOR telegraf
# Define a read only user
CREATE USER grafana WITH PASSWORD '< define a password>'
GRANT READ ON telegraf_database TO grafana
SHOW GRANTS FOR grafana
exit

Activate the authentication:

sudo nano /etc/influxdb/influxdb.conf
# set [http] auth-enabled = true
# Restart the service
sudo service influxdb restart

Now to be able to connect, you will need to pass the user name in the CLI, leave the password blank to get a prompt:

influx -—username 'admin' --password ''

StatsD through Telegraf

Install the latest version from https://github.com/influxdata/telegraf

cd /tmp
wget http://get.influxdb.org/telegraf/telegraf_0.10.2-1_amd64.deb
sudo dpkg -i telegraf_0.10.2–1_amd64.deb

Now let’s configure it to use StatsD as input source:

sudo nano /etc/telegraf/telegraf.conf

Set the database, user and password and add the following input service: it tells Telegraf to listen StatsD commands on the port 8125:

###################################################################
# SERVICE INPUTS #
###################################################################
# Statsd Server
[[inputs.statsd]]
# Address and port to host UDP listener on
service_address = “:8125”
# Delete gauges every interval (default=false)
delete_gauges = true
# Delete counters every interval (default=false)
delete_counters = true
# Delete sets every interval (default=false)
delete_sets = false
# Delete timings & histograms every interval (default=true)
delete_timings = true
# Percentiles to calculate for timing & histogram stats
percentiles = [90]
# convert measurement names, “.” to “_” and “-” to “__”
convert_names = false
templates = [
"* measurement.field"
]
# Number of UDP messages allowed to queue up, once filled,
# the statsd server will start dropping packets
allowed_pending_messages = 10000
# Number of timing/histogram values to track per-measurement in the
# calculation of percentiles. Raising this limit increases the accuracy
# of percentiles but also increases the memory usage and cpu time.
percentile_limit = 1000
# UDP packet size for the server to listen for. This will depend on the size
# of the packets that the client is sending, which is usually 1500 bytes.
udp_packet_size = 1500

delete_counter = true and delete_gauges = true are required if you don’t want Telegraf to continuously send previous values every interval (by default 10 seconds). I prefer to see only real measures in Grafana.

* measurement.field” is used to parse StatD variable name correctly for InfluxDB: it enables to have counters/gauges grouped by measurement (several fields in same measurement)
For instance:

  • memory.free
  • memory.used

You can have a look to InfluxDB key concept.

Then start the service and make sure it starts automatically on OS startup:

sudo service telegraf start
sudo update-rc.d telegraf enable

Now make sure port 8125 is opened and listening:

netstat -lntu | grep 8125

If not, you can have a look to errors in the Telegraf log in /var/logs/telegraf/

Grafana installation

cd /tmp
wget https://grafanarel.s3.amazonaws.com/builds/grafana_2.6.0_amd64.deb
sudo apt-get install -y adduser libfontconfig
sudo dpkg -i grafana_2.6.0_amd64.deb
sudo update-rc.d grafana-server defaults 95 10
sudo service grafana-server start

Then you can connect to Grafana from your browser:
http://localhost:3000/login

  • Set a password to the admin account
  • Add a data source:

Name: telegraf_influx
Type: InfluxDB 0.9.x
url: http://localhost:8086
Database: telegraf_database
user:grafana

That’s pretty much for the installation. You can go a little further by enabling SSL (https) on Grafana and InfluxDB.

Basic StatsD commands

  • Increment a counter (decrement is not supported by Telegraf):
echo "meas.field:1|c" | nc -C -w 1 -u localhost 8125

Here |c is the type of the value, it stands for counter.

  • Report a value (gauge |g), can be absolute or relatif (+/-)
echo "meas.temp:62|g" | nc -C -w 1 -u localhost 8125
echo "meas.temp:+2|g" | nc -C -w 1 -u localhost 8125
echo "meas.temp:-3|g" | nc -C -w 1 -u localhost 8125
  • InfluxDB supports tags (contextual values associated to a field), for instance the server name that reports the value:
echo "meas.temp,server=myServer:62|g" | nc -C -w 1 -u localhost 8125

See all supported types here.

NodeJS app instrumentation

StatsD is a very basic protocol. I recommend lynx for its simplicity:

npm install lynx --save

In your nodeJS files:

var Lynx = require('lynx');
var metrics = new Lynx('localhost', 8125);
metrics.increment('service.job_done');
metrics.gauge('service.queue_size', 100);
metrics.set('service.request_id', 10);
metrics.timing('service.job_task', 500); // time in ms

If you want to use tags, simply write

metrics.increment('service.job_done,server=\'myserver\'');

Notice that Telegraf doesn’t like anti-slash in the tag value (it returns parse error).

That’s it !

Jean-Christophe Baey

Written by

Entrepreneur, creator of @screenpresso, Software architect at @Groupe_Renault. Passionate about tech, content, design, software & startups.