The Case for Metrics In Jenkins

Simone Payne
Aug 9 · 10 min read

At Hootsuite, we know that we run up to 3000 Jenkins jobs per weekday. On weekends, it drops all the way down to around 40 jobs (there are automated jobs that are always running). Out of those jobs, 32.02% are built on master branches; they build changes for production services. The rest are tests, scheduled jobs, GitHub pull requests, and everything else we need to keep our pipelines and services running both smoothly and reliably.

We also know that Go projects account for roughly 20% of all the Jenkins jobs we build. Scala and Ruby projects each account for roughly 10%. Node projects follow at 9%. The rest use the base JNLP pod and have no specific language requirements, with the exception of a few PHP and other specialized jobs as can be seen below.

Pie chart of the commonly run Jenkins pods

We know all of this because these metrics are tracked. These tracked metrics allow us to better serve Hootsuite developers that rely on our internal platforms. They let us know which languages Jenkins needs to support, which projects are undergoing the most changes, and most importantly where our attention and maintenance is most needed.

This wasn’t always the case.

Pipelines of the Past

Just four years ago in 2018 we didn’t have any regular metrics on our Jenkins instance. We only had irregular metrics, run once or twice for a few data points at a time. But, without anyone regularly looking at them, and no maintenance done to retain them, the data was quickly lost. If we were asked how Jenkins was performing — or even if it was meeting our needs — we didn’t have an answer. We also didn’t know if our updates and changes improved Jenkins’ performance for developers or did the opposite. Rather than metrics, work on the Jenkins instance was entirely dictated by what complaints we received at any point in time. How many development or human resources did we need? Were we spending more on servers or development time than needed? Were jobs failing due to incorrect configuration, or not starting up at all? We didn’t have any way to tell. Our goal was to be able to answer all of these questions.

When we first started trying to collect meaningful data from Jenkins we started with the question we were asked most often: Are our Jenkins pipelines able to reliably deploy?

At this time, we had many different pipelines all with very different deploy patterns and stages making it difficult to know where to start. Without a common deployment pattern to use we decided to instead focus on the job that saw the most developer traffic. This was much easier to identify as we could see which projects on GitHub had the highest rate of commits. And so, we were able to choose a starting project: ‘The Dashboard’ pipeline.

As it happened, The Dashboard pipeline was also the pipeline we received the most complaints about. We could quickly make a big splash! Collecting metrics from The Dashboard would make it possible to show just how valuable consistent metrics would be to a large audience of developers. In the interest of making the biggest impression we could, we collected the first metrics for the Integration Test Stage of the The Dashboard pipeline. This would make the greatest impression because everytime this stage failed it sent a Slack alert to all of the developers working on the project. In fact, at this time there were many, many Slack messages sent for this stage of the pipeline. It alone accounted for the majority of complaints we received. But, just how bad was it? We were about to find out.

The Dashboard is a very high-traffic project at Hootsuite. It also had the greatest number of smoke tests; and, as smoke tests often do, they failed often. Every time these tests failed they would block the entire pipeline as multiple developers tried to get their changes out. It was a huge roadblock. We knew it wasted developer time; but, we wanted to know how much. Below is an old slide from four years ago; it shows the answer to our question on how often the build failed and the general build duration. The units are in minutes.

Our first metrics for the Dashboard Pipeline

This was just the preliminary data we received; but, the picture it painted was consistent with the data we continued to receive after it was rolled out. As you can see, for every successful build of The Dashboard there would be two failures. With build times in the ‘Build Duration’ section averaging more than 60 minutes, this meant rerunning these tests were wasting hours of dev time everyday.

In the data we continued to collect we learned that in the month of February four years ago, more than 290 hours were spent just rerunning tests for that single project. The Jenkins job for The Dashboard itself only succeeded 77 out of 218 times on the first run, which translates to an astonishing 65% build failure rate. At last, we had the data to prove that the Jenkins pipeline was in serious need of maintenance. After this we were able to expand our metrics gathered on Jenkins and built a better understanding around the most common issues our pipelines encountered.

This realization officially began our metrics journey four years ago; and, now in present day we have made many changes and improvements to how we collect our data. These days we use Prometheus for collecting metrics and have our metrics dashboards in Grafana. We have also moved our Jenkins instance onto Kubernetes. This means the amount of meaningful metrics we collect and our ability to use the data has increased significantly.

So, what do we do now?

New and Improved

These days we have many different metrics that we collect. This gives us a full picture of how Jenkins is performing. These metrics are collected in different ways to accommodate the different pipelines and stages that we run, ensuring that all of our pipelines can be covered. Currently, our two main sources of metrics come from the Jenkins Prometheus plugin and a Jenkins Library function we build to send metrics to a push gateway which then injects the metrics into Prometheus.

How we get our metrics from Jenkins to Grafana

The first way we collect metrics is via the Jenkins Prometheus plugin installed on our Jenkins instance. We use a forked version of the plugin so that it better meets our requirements. This is because we run our Jenkins instance on Kubernetes (using the Kubernetes plugin); and, while the prometheus plugin is great on its own, we found that due to the default naming scheme it was creating a separate metric for every kubernetes pod that was spun up. This was due to our Jenkins agent design where we have multiple different kinds of pods for the many different kinds of jobs that we build. Every pod also includes a random string in its name on creation and together this created an excessive amount of high-cardinality in our metrics. So, in our fork we removed the pod name from the metric name and instead have the pod names as labels to reduce the load as well as to match our internal metric naming conventions.

These metrics are exposed on the /prometheus path where they are then exposed on port :9191 using the metrics sidecar on the Jenkins instance. The metrics can then be found by the prometheus scraper which is running as a separate kubernetes pod and exported into Prometheus.

The metrics we collect using the Prometheus plugin are used to give us insight into how Jenkins itself is handling the jobs that are run. For example, the metrics we use include:

  • The queue time for jobs waiting to be built. This lets us know if pods for specific build types are taking an excessive amount of time to start up. This could include errors on the docker image used to create the pod on kubernetes, blocked queues due to locked jobs, and misconfigurations in the pod templates
  • The percentage of successful builds per pipeline. This alerts us if there are builds on Jenkins that consistently fail, or if we start seeing high rates of failure across all builds we can take action to investigate what is causing issues.
  • The total number of active pipeline jobs. This keeps track of any dead jobs that should either be maintained or deleted; but, also helps us identify the total amount of builds we are actively supporting
  • The rate of builds per hour. This lets us know during what times Jenkins is most heavily used. It gives us insight into how many resources we need to support at different times.

Our second metric collection method is done using a Jenkins Shared Library function that we built in house. This function can be called in any pipeline and provides more job specific metrics to developers to track how their project is performing. It is implemented through a shared library that is used by all pipeline jobs on Jenkins and it is integrated into the default build flow of all jobs. It also supports sending custom metrics for jobs that developers want additional insight into.

To collect these metrics we use both the Weaveworks prometheus aggregation gateway which we use for collecting counters and histograms built in the JSL function and the Prometheus pushgateway for collecting gauges. In the case of both of these gateways we build out metrics to match the notations data model for prometheus. This allows us to name our metric, assign any labels we want (such as job name and branch), and assign the value we want to send for the metric. These metrics are then also scraped in the same way as those sent by the prometheus plugin and are then also available in Grafana.

// url (String) : URL for either the aggregating-pushgateway server //                or the pushgateway server. Gauges are sent to
// pushgateway, all other metrics are sent to the
// aggregating-pushgateway
// msg (String): The data to be sent to prometheus in notation form
def send_message(String url, String msg) {
def pushGateway = new URL(url).openConnection();
pushGateway.setRequestMethod(“POST”)
pushGateway.getOutputStream().write(msg.getBytes(“UTF-8”));
def postRC = pushGateway.getResponseCode();
if (!postRC.equals(200)) {
println(“Unable to send metrics”)
}
}
Example of code used to send metrics sent to the pushgateways and the results

The metrics we collect using the Jenkins Shared Library include the duration for the different stages in a build pipeline, the success and failure rate of each of those stages, percentage of test coverage by different jobs, and how often each of our Jenkins Shared Libraries functions are called. The duration of stages tells us which areas are slowest and could benefit from caching or increased resources. The success and failure rate of stages lets us know where pipelines are most likely to fail and could use reliability improvements. Test coverage informs developers of how much confidence they can have that their deployment is stable. And, our metric for tracking how often our custom functions are called lets us know what developers are finding the most useful; or, if a function has low usage that either we have not shared it well enough with developers or it does not meet their requirements.

Together these two methods of metrics collection provide both our team and developers with increased insight into how jobs are performing on Jenkins and where our time would be best spent to improve developer experience.

We also have future plans to continue to improve our metrics and ensure that our data is useful. This will allow us to improve the experience developers have when interacting with a Jenkins pipeline. We also plan to revisit test metrics to investigate the value of data on not only on percentage of test failures and test coverage, but also on how often different tests for different features are run, and the percentage of different types tests (smoke, API, etc) that are run. With this data we want to build a better understanding of how testing is done at Hootsuite.

The Benefits of a Well-Oiled Pipeline

With these changes we have seen an increase in reliability of our running Jenkins jobs. Our new metrics system has allowed us to accurately track any issues that arise from the many pipelines that are run far more quickly than word of mouth and complaints. They not only allow us to act quickly when an incident occurs but also allow us to easily view the trends of our pipelines, often preventing incidents before they even happen. We now also use them as part of our SLOs by allowing us to promise developers that their pipelines will begin building within a certain timeframe, the reliability of different Jenkins agents that we use, and the availability of Jenkins itself. Finally, our metrics are also available on the pipelines that developers at Hootsuite manage themselves. This allows them to have better insight into their own code bases and the reliability of their own projects that are run on Jenkins.

Back to the example of The Dashboard project, there is now a dedicated team maintaining and improving the pipeline and its related services. It is in a much better state than it was before. Just look at our new metrics for the proof. Not only are builds passing with a success rate of 95% instead of 50%, but we also now have metrics on the Dashboard specific tests.

Current metrics for the Dashboard Pipeline

Hootsuite Engineering

Hootsuite's Engineering Blog