Container and System Monitoring with Docker, Telegraf, Influxdb, and Grafana on AWS
A container monitoring system collects metrics to ensure applications running in containers are functioning correctly. Metrics are tracked and analyzed in real-time to determine if an application is meeting expected goals. Container monitoring solution uses metric capture, analytics, performance tracking and visualization. Container monitoring covers various metrics like memory utilization, CPU usage, CPU limit, memory limit and many more.
Fig 1: Architecture Diagram for Container and System Monitoring with Docker, Telegraf, Influxdb and Grafana
- Grafana- Grafana is an open-source metric analytics & visualization suite. It is most commonly used for visualizing time series data for infrastructure and application analytics and is also used in other domains including industrial sensors, home automation, weather, and process control. It can be used on top of a variety of different data stores but is most commonly used together with either Graphite, InfluxDB, Elasticsearch or Prometheus. Visualizations in Grafana are called panels, and users can create a dashboard containing panels for different data sources. Grafana ships with a built-in alerting engine that allows users to attach conditional rules to dashboard panels that result in triggered alerts to a notification endpoint of your choice (e.g. email, slack, mail, custom webhooks).
- Telegraf- Telegraf is the open-source server agent to help you collect metrics with 200+ plugins already written by subject matter experts in the community. With the help of output InfluxDB plugin in Telegraf, we can capture and visualize the metrics on Grafana.
- Docker- Docker stats command provides an overview of some metrics we need to collect to ensure the basic monitoring function of Docker containers. Docker stats shows the percentage of CPU utilization for each container, the memory used and total memory available to the container. Added to that we can see the total data sent and received over the network by the container.
- InfluxDB- InfluxDB is a time-series database designed to handle high write and query loads. It is known for its homogeneity and ease of use, along with its ability to perform at scale. It helps to store the metrics. Also Grafana ships with very feature-rich data source plugin for InfluxDB. InfluxDB supports a feature-rich query editor, annotation and templating queries.
Setup steps of Monitoring system on AWS
Create an empty cluster with ECS optimised instance to run our monitoring system. The instance has a role attached to it with AmazonEC2ContainerServiceforEC2Role policy. We will install the requisite monitoring components on our ECS instance.
Elastic IP is attached to the instance launched in cluster so that Public IP remains static for connection with InfluxDB with any number of instances attached.
We can access the grafana dashboard using http:<elastic_ip>:3000 and Login credentials:
Then add the data source as InfluxDB.
Fig 2: Grafana Login Page
Fig 3: Datasource InfluxDB
We need to have Telegraf connected with InfluxDB via environment variables i.e. URL and database name with its username and password are set in Telegraf configuration file hence data source verifies all the parameters. This task is achieved by configuring output plugin named <outputs.influxdb> in Telegraf configuration file. Telegraf agent then posts its metrics to the Influx DB. For collecting docker metrics, similarly, <inputs.docker> plugin is configured.
For custom made dashboard JSON definition is imported which visualizes every aspect of the container as well as system.
Custom Made Dashboard :
Templated dashboard uses Telegraf as collector and Influxdb as data source. It gives a quick overview of Container and System Monitoring.
- No. of containers
- No. of Images
- No. of the container based on images
- CPU utilization container wise
- Memory container wise
- Disk i/o container wise
- Cpu utilization
- Memory utilization
….and mostly everything that could be possibly extracted from a server.
Docker host and Server will give a drop-down menu if multiple instances are monitored and configured with same elastic IP in <output.influxdb> plugin URL.
Fig 4: Overview of Container and System Monitoring
Fig 5: Overview of Container and System Monitoring
Fig 6: CPU Usage Monitored in respect of System
Fig 7: Per Container Monitoring with CPU usage, I/O, Memory Usage and Network
Fig 8: Per Container Monitoring with CPU usage, I/O, Memory Usage and Network
Fig 9: Per Container Monitoring with CPU usage, I/O, Memory Usage and Network
Fig 10: Memory Usage in detail for system
Fig 11: Kernel, Swap, Disk space usage, etc graphically represent for system monitoring
Fig 12: Kernel, Swap, Disk space usage, etc graphically represent for system monitoring
Fig 13: Kernel, Swap, Disk space usage, etc graphically represent for system monitoring
Any number of ECS optimized application instances can be launched with the same role attached as mentioned above and can be monitored by the existing monitoring solution by simply having Telegraf installed:
The instance can be in the same cluster or different clusters. In all the further instances we won’t require the installation of Grafana or Influxdb as we have customised the Telegraf configuration file to point to the existing instance. Once the service of Telegraf is started, it can attach to the Grafana dashboard and post metrics to Influxdb.
Note: Telegraf configuration file is stored in S3 bucket and then is being copied to /etc/telegraf/telegraf.conf
One can easily set up a monitoring tool with different data types and their respective node collector. We have used InfluxDB as a data source which is an open-source tool based on push-based architecture and support integer and string both data type. It only requires one node collector for collecting container and system monitoring hence reduces the complexities. InfluxDB requires Kapacitor for the alert manager but instead of using an additional tool one can simply use Grafana’s alert system. Hence by saving time and memory with reduced complexity, we can easily monitor any container within any number servers.