Hootsuite is a big advocate of continuous integration (CI) and continuous delivery (CD). These tools allow developers to ship value to the customer quickly, reliably, and regularly. Every service, from our main dashboard — to dozens of customer-facing microservices — to internal tools, are built and shipped using an automated pipeline. This encourages a culture of agility, taking calculated risks.
My team mission was to improve developers productivity and satisfaction and to enable them to deliver stable and reliable software to our customers as quickly as possible. Having CI and CD being such a big part of developers’ day-to-day workflow, we decided we need a tool to help us measure and identify any pain-points and delays developers have to deal with. Developers should not spend their time debugging tests that would auto-resolve on a re-run, or be blocked for an extended period of time while waiting for their code to deploy.
Jenkins Metrics Phaser
Jenkins Metrics Phaser is our internal tool for the collection of all metrics related to our build and deploy pipelines. It’s ultimate goal is to enhance developer productivity and satisfaction. It aims to do so by tracking pipelines performance, flagging pain-points, and collecting feedback from developers. It also allows us to gain visibility into all our pipelines across the variety of microservices and tools we have.
With our usage of Jenkins through the years we have a very diverse Jenkins setup. Our legacy dashboard Jenkins is self-hosted on our local servers. We have cloud-hosted Jenkins instances for a couple of our teams. Finally, all of our newer instances are hosted in Mesos clusters with each instance belonging to its designated team. We needed Phaser to aggregate all of the data from these instances into one platform.
Getting information from Jenkins
We use Amazon Simple Notification Service (SNS) and Amazon Simple Queue Service (SQS) to reliably deliver data from Jenkins. All of our Jenkins instances are pre configured with a custom made Jenkins plugin for sending data to our SNS topic after the completion of each job. The SNS topic instantly pushes these messages to the SQS queue to allow for real-time processing.
This setup was proven to be reliable even when things break. One time a third-party dependency used by Phaser introduced a breaking change downstream and caused our service to be down for a few hours. As we tackled the problem and worked on a fix, we weren’t worried about data loss as we knew the queue would just grow larger but the messages would remain there. And indeed, once we brought Phaser back up it immediately picked up from where it left off and fully recovered within minutes, without any data loss.
Saving the information in our database
Each batch of SNS messages gets distributed between goroutines for asynchronous processing. Every message gets stripped down to its essential information and gets passed along to our main controllers. The controllers are under two main categories, with each corresponding to its designated database tables:
- Build and Deploy: responsible for storing information about all pipeline jobs. Each table row represents a single Jenkins job, with information such as result, time, duration, etc. One key detail is each job also contains a relational link to it’s triggering (parent) job. This allows us to use the data to visualize an entire pipeline workflow.
- Test: responsible for describing the performance of unit and integration tests of all pipelines. After identifying that a job is in fact a test job, Phaser posts to the Jenkins API to get test results in JUnit XML format and parses them down to individual test units. We convert all test results, regardless of the language they were written in, to JUnit XML format.
Notifying users via Slack and classifying pipeline failures
Upon receiving a failed job on a master branch build, Phaser sends the failed job information to our Slack bot through a completely non-blocking Go channel. The bot uses the git commit information to get the developer’s email, which gets correlated with his/her slack user. Phaser then sends an interactive slack message with a summary of the error, link to the job, and the option to classify the error. We use this data to identify common errors and specific pain-points which we can tackle in order to make an overall better experience for our developers.
Visualizing the data
We developed an internal API and UI platform for the use of both our team and developers which showcases metrics for all pipelines. Each individual pipeline has its own sub-platform with a variety of metrics such as: build times, longest tests, most failing tests, merge freezes, jobs stability, etc.
Our focus with Phaser was to build a unified platform for monitoring all of our pipelines with the ultimate goal of finding ways to improve developers’ experience. Both Go and AWS allowed us to develop a concurrent, fault-tolerant system. Slack allowed for easy and convenient delivery of important information directly to developers, and the ability for the developers to interact back.
About the Author
Elad Michaeli is Software Developer Co-op at Hootsuite. Elad studies Computer Science at the University of British Columbia. Connect with him on LinkedIn