Jenkins agent monitoring with Prometheus

Published in

Tresorit Engineering

5 min readMay 17, 2021

Following this guide you can create a Jenkins job, to forward basic monitoring metrics about the connected Jenkins agents to a Prometheus service.

It is useful, as you spare the deployment and maintenance time/costs of a monitoring agent on your connected machines, but get some preemptive alerts that could help to avoid issues related to the agents themselves. Some dashboards and queries could even help later investigations.

The Prometheus Jenkins plugin does not serve agent-centric metrics, so this guide gives extra values compared to that.

Overview and capabilities

A job with a system Groovy script generates a plaintext webpage, with the already available metrics collected by the agent.jar processes, to be scraped by Prometheus.

Overview | Prometheus

Prometheus is an open-source system monitoring and alerting toolkit originally built at SoundCloud. Since its…

prometheus.io

Prometheus is a mature monitoring service that is worth being used already in small/middle-sized environments. Its HTTP-based scraping mechanism helps readability and easy development of exporters.
Exporters need to expose gathered metrics to the Prometheus server using a plaintext web page, which can be even the last successful build’s archived artifact view of a Jenkins job with a fixed URL.

In such a job with a System Groovy script, you can collect useful metrics of the connected agents through the Computer class.

Prerequisites

Jenkins instance with Groovy plugin and connected agents
Existing Prometheus setup

Gathered metrics

Metrics not available from node_exporter
(default Prometheus metric collector)
⤷Connected/Online Status
⤷Total/Busy Executor Number
⤷Response Time
Clock Difference
Total/Free Disk Space (only disk where workspace resides)
Total/Free Temp Space
Total/Free Memory Space
Total/Free Swap Space

All of these metrics are gathered by the Jenkins agent.jar processes by default and are available on the https://${jenkins_base_url}/computer/ address.
With the Monitoring Jenkins plugin, you may collect even more metrics, more on that later.

Step by Step

This section will show how to create the Jenkins job and the matching Prometheus configuration to forward the metrics.

Create the Jenkins job

If you are familiar with Jenkins, the job is a no-brainer.
Here you can check this jobDSL script and skip this list.
Otherwise, this summarizes the job setup:

Build retention: I recommend keeping builds for 1-day maximum, depending on how fast an alert is reacted.
Throttle Concurrent Builds: Maximum 1 should be executed at all time
Restrict where this project can be run: These metrics are stored in the Jenkins JVM itself, so I recommend ‘master’, but it is OK on other nodes too.
Build periodically: My suggestion is around 5 minutes, these metrics do not change fast: “H/5 * * * *”
Abort the build if it’s stuck: I recommend 1 minute, these metrics are not runtime collected from the agents, but only cached values are “printed”.
Build: Execute system Groovy script
You can get the script from GitHub: AgentExporter.groovy
It does not support running it in sandbox mode, maybe it could be refactored to support it. Suggestions are welcome in the comments.
Post-build Actions: Archive the artifacts
Filename: “prometheus”

Try it out, have a look on the resulting artifact on the url: https://${job_url}/lastSuccessfulBuild/artifact/prometheus/*view*/

You should see something like this:

# HELP clock_diff Agent system time difference in ms relative to Jenkins master's
# TYPE clock_diff gauge
jenkins_agent_clock_diff{agent_name="agent1", node_labels="linux x64", ip="10.0.210.151"} 26
# HELP response_time Agent round-trip response time in ms from Jenkins master
# TYPE response_time gauge
jenkins_agent_response_time{agent_name="agent1", node_labels="linux x64", ip="10.0.210.151"} 89
...

Add Prometheus target

Create or select an existing functional (non user) Jenkins account for this, and go to https://${jenkins_base_url}/user/${username}/configure
Get your secret under the section API Token: Add new token.

You have two options how to forward this secret to Prometheus:

Recommended: Store in file and refer to it in the prometheus.yml config
⤷You need to make available the file for the Prometheus service
⤷Storing the token in the file makes it easier to version control the config, and save the secret separately.
Write the secret into the prometheus.yml config
⤷Use password instead of password_file in the example

Here is a config part for the secret-file option: prometheus.job-config.yml

Restart the Prometheus service or just reload the configuration and check whether the new job and its target appear. Next, you can play around the graphs and find out how it serves best your use-cases. 😉

Script details, extension points

The script, generating the plaintext file to be scraped by Prometheus: AgentExporter.groovy

Inspiration

The following scripts inspired the final solution, maybe they spark more ideas for you:

Extension possibilities

If you use more monitoring-related plugins in Jenkins, you may add further metrics to be exported. As System Groovy Scripts have access to the complete JVM of the Jenkins service, any plugin gathered metric should be available from it.

Disk usage plugin: Reports a summary for each job how much disk space it consumes. Plus about the JENKINS_HOME directory and its typical subdirectories.
Monitoring plugin: Provides JavaMelody metrics on Jenkins pages, but it has examples of how they can be reached from a script: MonitoringScripts.md
Possible further metrics:
⤷ CPU load
⤷ JVM related: memory metrics, threads, GC
doScriptText(): In theory, you can run arbitrary Groovy script and parse returned text, but it feels… alarming for me. Jenkins agent connection was not designed to support unique metric gathering. Of course, it was designed to execute arbitrary build steps, but this call works without the safety net, what a normal build step has (file/process cleanup, parallel execution, etc).

Security

Due to its pull methodology, the Prometheus server needs to reach the Jenkins webserver. This typically does not require any firewall/networking modification. Authentication could be achieved with a functional user’s access token, more on that below.

For some, the System Groovy script may seem to have too much privilege. As I see it only depends on correct role handling, who can modify/create such a job.

Preemptive alert recommendations

Here I share the Prometheus alerts I found useful based on the new metrics: alerts.yml

Alert if systems’ clock skew is above a threshold
Alert if the available workspace is below a threshold
Alert if metrics are older than a threshold
Alert if Node is not connected but was not put offline by admins

The last_collection_time metric supports alerting if it is too old (see examples), but maybe a post-build step sending slack/email notification in case of build failure could be useful too.

Closing thoughts

What you really spare with this solution is the deployment and maintenance of the metric collector service, as they are already done with the agent.jar. Due to its simplicity, the networking setup could be easier too than in other solutions. However, the gathered metrics are not as rich as with a normal node_exporter.