Prometheus + Grafana + Node
Playing with Prometheus and Grafana. Testing Node Exporter, Alertmanager and Slack notifications.
The Goal
- Create a website with node
- Export personal metrics from the site to Prometheus
- Export also system metrics with Node Exporter
- Display metrics in Grafana
- Create alerts rules on specific metric
- Receive Slack notification when an alert is fired
- Redo everything with docker-compose
Install, setup and explore the project
Get the code from this github repository :
To setup the project, run the following command :
# install stress + docker pull prometheus + node-exporter + alertmanager + grafana ...
$ make setup
Configuring slack notifications
The project uses Slack notifications.
In my Slack account, I have two channels #my-channel
and #another-channel
.
They are configured to receive notifications with Incoming Webhooks :
I search for the Webhook URLs of each channel :
I modify my two files local-alert.yaml and compose-alert.yaml.
To replace api_url
values with my Webhook URLs :
The configuration is complete.
Exploring the website
The website uses the npm module prom-client.
The server uses 3 metrics :
- Counter : a counter is a cumulative metric that represents a single monotonically increasing counter whose value can only increase or be reset to zero on restart.
- Gauge : a gauge is a metric that represents a single numerical value that can arbitrarily go up and down.
- Histogram : a histogram samples observations (usually things like request durations or response sizes) and counts them in configurable buckets. It also provides a sum of all observed values.
Let’s start the website :
# local development (by calling npm script directly)
$ make dev
By opening the address http://localhost:5000 you can see this tiny website :
By displaying page counter we increase the request_count
metric :
By displaying page push we increase the queue_size
metric :
By displaying page pop we decrease the queue_size
metric :
By displaying page wait we vary the request_duration_bucket
, request_duration_sum
and request_duration_count
metrics :
The page metrics display all the metrics :
Prometheus
Let’s start Prometheus :
# run local prometheus
$ make local-prometheus
This command does this :
$ docker run --detach \
--name=prometheus \
--network host \
--volume $(pwd)/local-prometheus.yaml:/etc/prometheus/prometheus.yaml \
--volume $(pwd)/local-rules.yaml:/etc/prometheus/rules.yaml \
prom/prometheus \
--config.file=/etc/prometheus/prometheus.yaml
Prometheus is configured with the local-prometheus.yaml file :
Let’s detail this configuration :
- Prometheus will retrieve metrics from
0.0.0.0:5000
, those emitted by our website in localhost. - Also retrieve metrics from
0.0.0.0:9100
. These are the metrics emitted by Node Exporter that we will install later. - Use Alertmanager which listens on
port 9093
and which we will install later. - Declare rules via the
rules.yaml
file.
Here is the content of the rules file :
Let’s detail this configuration :
- We define the recording rules
node_cpu_seconds_total:avg
- Recording rules allow you to precompute frequently needed or computationally expensive expressions and save their result as a new set of time series.
- The associated expression
expr
calculates the CPU usage - We define 2 alerting rules which will be triggered if
node_cpu_seconds_total:avg > 45
or ifnode_cpu_seconds_total:avg > 80
Prometheus is launched. We can see the rules by opening http://localhost:9090/rules :
We can see the alerts by opening http://localhost:9090/alerts :
We can now use Prometheus to display our metrics.
The URL http://localhost:9090/graph?g0.range_input=5m&g0.expr=queue_size&g0.tab=0 displays the evolution of the queue_size
metric that we have varied via our website by visiting push and pop pages :
node-exporter
We install Node Exporter :
# run local node-exporter
$ make local-node-exporter
This command does this :
$ docker run --detach \
--name node-exporter \
--restart=always \
--network host \
prom/node-exporter
Once installed, the metrics are available at the address http://localhost:9100/metrics :
Now all these new metrics are available in Prometheus :
The URL http://localhost:9090/graph?g0.range_input=5m&g0.expr=node_cpu_seconds_total&g0.tab=1 displays the evolution of the node_cpu_seconds_total
metric :
Installing Alertmanager
Let’s install Alertmanager
# run local alertmanager
$ make local-alertmanager
This command does this :
$ docker run --detach \
--name=alertmanager \
--network host \
--volume $(pwd)/local-alert.yaml:/etc/alertmanager/local-alert.yaml \
prom/alertmanager \
--config.file=/etc/alertmanager/local-alert.yaml
Once installed, the Alertmanager is available at the address http://localhost:9093 :
Stress test
We will trigger an alert by stressing our CPU with the stress executable :
# hot !
$ stress --cpu 2
The URL http://localhost:9090/graph?g0.range_input=1h&g0.expr=node_cpu_seconds_total%3Aavg&g0.tab=0 displays the evolution of our custom metric node_cpu_seconds_total:avg
:
We see that 45% of CPU usage is exceeded, an alert is triggered and displayed in Alertmanager :
You can see this alert triggered also in the Prometheus interface :
And our Slack channel has been notified :
Installing Grafana
Grafana allows us to display this data in the form of a pretty dashboard.
Let’s start Grafana :
$ make local-grafana
This command does this :
$ docker run --detach \
--env GF_AUTH_BASIC_ENABLED=false \
--env GF_AUTH_ANONYMOUS_ENABLED=true \
--env GF_AUTH_ANONYMOUS_ORG_ROLE=Admin \
--name=grafana \
--network host \
grafana/grafana
After a few seconds of initialization, Grafana is visible at the address :
Grafana however still needs to be configured :
- First, add a datasource
- Then add a dashboard
We can configure it with this command:
$ make local-grafana-configure
This command adds Promotheus as a datasource like this :
$ curl http://localhost:3000/api/datasources \
--header 'Content-Type: application/json' \
--data @local-datasource.json
The local-datasource.json file is simple :
This command then installs a dashboard specially designed to retrieve data from Node Exporter.
The command retrieves the dashboard from this JSON data : https://grafana.com/api/dashboards/1860.
# create dashboard-1860.json
$ curl https://grafana.com/api/dashboards/1860 | jq '.json' > dashboard-1860.json
To be able to be imported into Grafana, we need to modify our JSON by wrapping it like this :
We now add it to Grafana :
# add dashboard-1860-modified
$ curl http://localhost:3000/api/dashboards/db \
--header 'Content-Type: application/json' \
--data @dashboard-1860-modified.json
Another dashboard specific to the metrics of our website is added.
We reload our browser, we see that the dashboards have been added :
We display the Node Exporter Full dashboard :
We display the My dashboard dashboard :
A new stress test
We are going to stress our CPU again. This time a little stronger :
$ stress --cpu 3
We see that 80% of the CPU usage is exceeded :
Our 2 alerts are triggered and displayed in Alertmanager :
We can also see our alerts triggered in the Prometheus interface :
The #my-channel slack channel has received the warning notification :
The #another-channel slack channel has received the critical notification :
Our tests are now complete, we can remove the running containers :
# remove all running containers
$ make rm
Using Docker-compose
The goal is to set up a similar environment using docker-compose.
One command is enough :
# docker-compose up
$ make compose-up
It is interesting to go and see the configuration files.
The docker-compose.yaml file :
Note how the Promotheus configuration has been modified :