Monitoring — StatsdD, Telegraf and InfuxDB

Abhinav Rai
Nov 4 · 4 min read

Monitoring is very important for any system. Both for system variables and application. How many people logged in to your site at a given time, what is the status of some request sent, how much is it taking, is it success/failure, how many people clicked the button and so on.

If you can’t measure it, you can’t manage it

Well, Why not go with new relic?(Some money off the pockets, but why not if you have money!)

  1. We are operating at a scale where we cannot rely on third party products to scale and need in-house products.
  2. We are strong supporters of Open Source Software. This new monitoring systems use only open source products.
  3. With a lot of microservices arc, third party apps doesn’t offer certain customisations we’re looking for and more.

What is StatsD?

The term StatsD refers to both the protocol used in the original daemon(first node js background job), as well as a collection of software and services that implement this protocol. It’s a simple text oriented protocol.

A StatsD datagram, which contains a single metric, is made up of 4 things:

  • Bucket — Identifier for the metric. Ex: customer.v1.booking and customer.v2.booking
  • Type — Type of metric to capture. Explained below.
  • Value — Value to capture. Depends on the type.
  • Sample Rate — What percentage of stats we want to send. For 1 million users, we might not want all to be sent. 0.1 means 10% of them will be sent. (Generally never used)

StatsD allows you to capture different types of metrics depending on your needs. The metric types are:

  • Counter: Count occurrences of an event. Counters are often used to determine the frequency at which an event is happening. Example: authentication failures, response codes(2xx, 3xx, 4xx and 5xx).
  • Timers: Measure the amount of time an action took to complete, in milliseconds. Example: response time of APIs
  • Gauges: Arbitrary, persistent values. Once a gauge is set to its value, the StatsD server will report the same value each flush period. Flush period is the time when this aggregated data is sent to the backend (10 seconds). Backend can be Influx, promethius, etc. Example: memory usage, SHA of latest deployed commit.
  • Sets: Report the number of unique elements that are received in a flush period. The value of a set is a unique identifier for an element you wish to count. Example: number of active users.

Depending on the use case, you can go for either UDP or TCP. UDP is fire and forget and thus very fast as compared to TCP.

Why does StatsD adopt this client-server-backend model? For two reasons: (a) language independence and (b) reliability. By relying on a simple, text-oriented protocol, StatsD quickly developed an ecosystem of clients for most languages and frameworks in use today. It also ensured strict isolation between the application (and the StatsD client) and the rest of the instrumentation. Should the StatsD server crash, it would have no effect on the performance of the application beyond the loss of instrumentation.

What is Telegraf?

Telegraf is a plugin-driven server agent for collecting and sending metrics and events from databases, systems, and IoT sensors.

Telegraf is written in Go and compiles into a single binary with no external dependencies, and requires a very minimal memory footprint.

What does plugin driven server means?
Telegraf has plugins for input, output, aggragators, etc. We can use any plugin for input/output we want. Say for input, we can use either statsD udp or can use http. Same for output, we can either use InfluxDB or datadog or file or elastic search, etc. In out case, we will be using statsD udp for input and InfluxDB for output. Whatever Grafana dashboards we create will use this db to show the graphs.

What is InfluxDB?

This is a time series database which is indexed on time. This is the ‘I’ in TICK monitoring. Grafana will talk to this db as source. Telegraf periodically flushes the data to InfluxDB after collecting and aggregating the metrics (This period is called flush time, 10 sec by default). InfluxDB is a NoSql database specialised in storing time-series data. InfluxDB is also written in Go and designed to handle high write and query loads.

What is Grafana?

Grafana is used to visualise and analyse data collected in InfluxDB. It‘s an open platform for beautiful analytics and monitoring. The generated dashboards are not only good looking but Grafana does an excellent job in making query editor so simple and powerful. Setting up a new dashboard in Grafana is really smooth. We can use datadog also but this defeats the whole purpose of not relying on 3rd party tools (Else we would have used new relic all the way!).

Basic Setup

To Set it up locally, very beautiful read:

https://medium.com/@nagaraj.kamalashree/how-to-install-tig-stack-telegraf-influx-and-grafana-on-mac-os-b989b2faf9f8

Abhinav Rai

Written by

Product Engineer at Go-Jek | Guitarist | Traveller | Entrepreneur

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade