Improving your flaky builds with data

Reuben Sutton
Buildtrakr
Published in
3 min readJun 4, 2020

At Buildtrakr, we speak to engineering leaders constantly and we hear a lot about how they define success, what they need to know to become high performing, and what keeps them up at night.

Nearly everyone we speak to defines success as predictably releasing great software to their customers, and their biggest fear is that they have an unknown problem forming that will prevent them from being able to release frequently.

One of these hidden problems is that teams have a high rate of failed builds. When you have a flaky build process, it’s very hard to track an architecture with tens or even hundreds of independent repositories all building continuously. These flaky build processes are toxic to teams.

In the early days of a flaky build process, the developers on the team will swarm on the build to try and get it to succeed — taking their time away from delivering valuable work. After a while though, this motivation begins to ebb away and the time it takes to get a working master branch again creeps up, and this lasts until the investment is made into improving the codebase and build pipeline.

There are few things worse for the morale of a team than being hindered by processes which hold them back from achieving their targets. This leaves us with the question of how do you know which your high priority builds to fix are, and when you need to invest the time in improving them.

Rate of failure is creeping up and there are hundreds of failed builds

Above are a couple of metrics from Buildtrakr which shows a commercial open source project. It looks terrible, but these rates are more common than you might think. Can you imagine having to work in an environment where a good week for build failure is 20% of the time, and in a bad week it’s 50%?

Clearly, this would be enough to encourage you to take action — but what do you do? There’s not enough information in the overall number. So, let’s try breaking the build failures down by branch.

The dev branch is failing 35% of the time!

We can see that the dev branch is far more broken than master — but in a git-flow environment, dev is where the most active work is happening and master is a “release” branch. Let’s dig a little further into what’s going on in this dev branch by breaking down individual pipelines in this repository.

macos_jobs fails nearly 68% of the time!

So, now with a few clicks, we know which our troublesome pipeline is, how bad the problem is, and we can start working on digging into where these failures are happening.

See a heat chart of flaky files

Once the team starts working on the problem, we can also tip them off to places that are sensible to start refactoring the codebase. Here you can see there are a few key code hotspots which are likely files which have either fragile test coverage or are in desperate need of some refactoring — likely both!

Hopefully, this inspires you to find out what your build failure rate looks like, and to start improving it!

If you use CircleCI, you can give Buildtrakr a try at https://www.buildtrakr.com. It comes with a 30 day free trial, and is free forever for teams of up to 10 developers.

I’d be more than happy to discuss anything in this post — you can reach me at reuben@buildtrakr.com

--

--