Ease of monitoring production issues using TIG stack

7 min readMay 2, 2020

Large companies have hundreds of microservices, some on VMs, and some on Kubernetes, where new features are built every day and deployed. Imagine how difficult it is to find out if any application is failing or any specific critical API is failing without monitoring?

Let’s consider one of the services in a ride-hailing application, Complaints automation system. It is used to automatically solve complaints raised by the drivers and customers. Complaints like a customer lost their belongings in the cab, the driver did not pick up the customers etc are reported. I was part of the team that worked on automating these issues. Previously, these issues used to be handled manually. The volume of issues raised was around 10000 per day. If this application is down, then no complaints will get resolved. If this system is not reliable, then the workload of manual verification increases. And the customer/driver needs to wait for hours for their complaints to get resolved.

We once faced a major incident when we tried deploying refactoring which included database migration changes and major code rewrites. Prior to the deployment, we had unfortunately missed testing a few scenarios. Due to which it broke the production deployment. Hence there was no automation that happened for 4 hours and all the tickets were sent for manual verification and because we hadn’t set up appropriate monitoring in place, we failed to receive any notification for this incident, and agents had to report this production issue to us.

We also faced other problems like Redis instance down, automation configuration mismatch etc which took us a long time to resolve. There was a time when automation was stopped for one of the issues due to configuration mismatch and we didn’t even realize it for a month. This is the case when we do not have proper monitoring in place.

After these frequent production issues, we started capturing all the events that happened in the automation service. We started keeping track of issues that we were automating. Stats were captured for how many complaints were successfully resolved by the automation set up and how many failed with 4XX or 5XX errors. We also kept track of response times for the database queries and API requests. We started monitoring critical APIs carefully by adding alerts for even a single failure. We started capturing system-related metrics like disk space, memory, CPU usage, etc. Once we had these necessary monitoring in place, we were able to fix production issues much quicker. And we got to know how the new features performed on production. And also got to know how many people used the new features that we had deployed.

Complaints automation system with monitoring setup

Monitoring System

Monitoring system mainly comprises of metrics, monitoring, and alerting. Metrics is the raw data that you get from your application or system. Raw measurements can be low-level data like disk usage or high-level data based on the feature you have deployed. Monitoring is a process of collecting, storing, and analyzing data. We absolutely need measurements and metrics to understand the performance of the system and infrastructure in real-time. Alerting is used to alert the user when the threshold is met.

If you are to build your own monitoring system, then you need a collector for collecting metrics, a store for storing metrics, a visualizer to set up dashboards, and an Alerter for alerting when something is wrong. There are multiple ways we can monitor applications and systems. One way is to use the TIG stack.

TIG stack is an end-to-end open-source solution for monitoring applications. It has three components. Telegraf for collecting metrics, the Influx database for storing data, and Grafana is used to visualize and alerting.

Telegraf

Telegraf is a metrics collecting agent. It is optimized to write to the Influx database. It runs on a VM or as a pod or as a sidecar on the Kubernetes cluster that can output metrics. It is written in Go and compiles into a single binary with no external dependencies. It is plugin driven. It supports collections of metrics from 100+ popular services by using plugins. It contains 4 types of plugins.

Input plugins: It is used to collect metrics from systems, services and 3rd party APIs.
Example: Postgresql plugin is used to get metrics from Postgres database.

Output plugins: Telegraf writes the metrics to various sources by using this plugin. Example: InfluxDB output plugin sends metrics to influxDB.

Aggregator plugins: used to create aggregate metrics. Example: Merge is used to merge metrics.

Processor plugins: Used for transform, decorate, and filter metrics. Example: Regex plugin transforms data based on regular expression.

It’s very easy to add a plugin in Telegraf. Below is the image of configuration needed to add a `mem` input plugin which is used to get metrics for memory usage. This configuration is written in its configuration file.

#Read metrics about memory usage
[[inputs.mem]]

Telegraf can work on both pull or push-based models. In the pull-based model, monitoring agents pull the metrics from systems periodically. It pulls data from targets, formats the metrics into influxDB line protocol, and sends them off to influxDB. In the push-based model, metrics are pushed to the monitoring agent. It has plugins for pulling and pushing metrics.

For sending the metrics from a system like a database running in VM, we install the Telegraf. And Telegraf will pull data from the VM by using plugins like cpu, mem. For getting metrics from an application, we use statsD plugin which follows a push-based model.

StatsD

StatsD is a simple daemon to collect and aggregate application metrics. StatsD architecture consists of three main components, client, server, and backend. In our application code, we invoke the statsD client for sending metrics to the statsD daemon which is running in Telegraf. There are language-specific libraries available for statsD clients. For example, in ruby there is statsD client called statsd-instrument. StatsD server aggregates metrics by default for 10 seconds and flushes the metrics to the backend like an influx database.

StatsD client communicates with the statsD server using the UDP protocol, which is fire and forget. So our code does not have to wait for the response. Hence it’s faster. StatsD server will push the metrics to the backend chosen by us. Basic data that is sent by statsD client contains 3 values. They are metrics name, metrics value, and metrics type.

<metrics_name>:<metrics_value>|<metrics_type>example:
ticket.automation.time:100|ms

Metrics name is also called a bucket is the name of the metrics. Metrics value is the number associated with the metrics name. And metrics type can be one of the following:

Timers: measures the amount of time taken to complete.
Example: we can check how much time the query took to execute.
Counters: used to determine the frequency at which the event is happening. We can increment or decrement the counter.
Example: We can increment the counter for failures.
Gauge: It takes arbitrary value. Example: active database connections.

Influx database

Time series is a series of data that is indexed in time order basically data with a timestamp. The time-series database is optimized for storing time-series data. It handles a high rate query load. Influx database is one of the time-series databases. It has a cool feature called retention policies for automatically deleting data. And it’s easy to learn because it has a SQL like query language called InfluxSQL.

Single data record in influxDB is called as point. InfluxDB line protocol is the text-based format for writing points to the database. It is the text-based format that provides measurement, tag set, fieldset, and timestamp. In an influx database, the table name is referred to as measurement, indexed data is called tag set, non-indexed data is called a field.

Grafana

Grafana is used for visualization of metrics in the dashboard and setting up alerts. We can create the dashboard and graphs for the metrics from data sources like the influx database, Prometheus, elastic search, etc. And we can set a threshold for getting alerts. Alerts can be sent to slack, email, pager phones etc.

Summary

The monitoring system consists of a metrics Collector, Datastore, Visualizer, and Alerter. TIG stack is made of 3 components, Telegraf, Influx database, and Grafana. Telegraf is used to collect metrics. InfluxDB is a time-series database used to store time-series data like metrics. Grafana is used for visualizing and alerting. The simplicity of the monitoring system is that it is plug and play. Grafana can be replaced by Chronograf and Kapacitor. Similarly, we can use Prometheus instead of InfluxDB. And also, Telegraf can collect metrics from different sources. All these components are open source and can be installed easily.

Useful links: