Use a repository

as your CI database

After your CI server builds your product, what does it do with the build artifacts? In my experience, it stores them on a disk somewhere, using some ad-hoc version scheme to distinguish between different builds. It might also publish them to some artifact server, which again uses some ad-hoc version scheme to allow users and developers to download different versions.

I suggest that the CI server should instead use a repository for versioning and distribution of build artifacts. Specifically, it should do something like this:

  1. Check out source: git checkout e3125fa.
  2. Run build system: ./configure && make foobar.
  3. Commit build artifacts: git add -A && git commit. (Might need to override .gitignore files.)
  4. Give the build an ID: git tag build/610.
  5. Publish the build to an artifacts server: git push artifacts && git push --tags artifacts.

The advantages of this approach are several:

  • Easy checkout of builds. Run git fetch artifacts build/609 for a particular build.
  • Every build identifies the commit it was built from. It’s just the parent commit! You don’t even have to check out the source commit — the source is there in the build commit.
  • Easy comparison of different builds. Run git diff build/609 build/610. To compare your local build to a CI build, run git diff build/609. To find out where a failed build went wrong, git diff the failed build log against the last successful build log.
  • Efficient storage of build artifacts. Your CI server probably stores ‘build 609’ in a directory next to ‘build 610’, even though 99% of the content of those builds is identical. Using git’s content-addressable file store and intelligent pack files, de-duplication and compression is automatic. You might find that you have to delete fewer old builds due to disk space problems.
  • Easy signed build artifacts, and other metadata. Just use a signed tag! All tag/commit properties are available; e.g. set the commit Author to the CI server to give your builds a provenance.
  • Simplicity! Sometimes it feels like our CI servers and repositories are stuck together with tape and string. If you use Git for storing builds, why not store all other CI configuration in Git — Git becomes the CI database, and the CI server becomes completely stateless!

(Really, I’m just re-hashing the advantages of git, but applied to build artifacts rather than build sources. So feel free to extend this list of advantages using your preferred git evangelism arguments.)

‘I was told not to commit build artifacts!’

Yeah. When people say that, they’re referring to a scheme where your build directory is tracked just like the source directory, and prior to committing something, you first run the build system. And indeed this is a bad idea. It’s bad because it you to run the build system for every commit, which interrupts your work-flow. It’s bad because in reality people don’t run the build system for every commit, so you end up with lots of commits where the source and build directories are in inconsistent states. It’s bad because it forces developers to keep every build artifact on their local machine. And it’s bad because it presupposes that for each given source state, there exists one exact deterministic build output, but in reality this is not the case.

What I’m suggesting is only a slight variant of this scheme, but it does not suffer from the problems associated with it. It doesn’t require you the developer to run the build system for every commit; the CI server does it. It doesn’t require you to remember to do any building; the CI server does it. It doesn’t require you to store all build artifacts on every development machine; developers can pull down whichever build commits they want. And it doesn’t suppose deterministic builds; you can easily have two build commits build/609 and build/610 which have the same parent commit. something that we’re told not to do.

Despite the weaknesses of the “commit the build artifacts with the source” scheme, I see lots of projects using it — particularly front-end web libraries. I think they do this because it provides an easy distribution mechanism — users just grab a compiled, minified version straight from the source. I think my scheme provides this simplicity without the significant disadvantages.

‘But my artifact is just a big binary blob!’

No it isn’t. You have a test suite, right? The test results are also build artifacts, and so you are committing those too. The test results are effectively a textual, static representation of your binary program. The test results express the behavior of your program in all its facets. A diff between test results, then, perfectly expresses changes in the behavior of your program. And you probably have other textual build artifacts — the build log, your generated documentation, et cetera.

Tracking multiple builds per commit

You ran several builds for commit e3125fa. These are tagged build/610, build/623, build/645, and build/646. How do we make it more obvious that these builds are all built against the same commit? A simple way would be to change the naming scheme. The incrementing numeric scheme that our traditional CI servers use could be made more semantic. How about instead tagging them as build/e3125fa/1 through build/e3125fa/4?

Tracking builds for a source branch

We might also want to group our builds according to which source branch they were built from. Can we have a build/master branch which tracks the builds for the master branch? Sure: every time a build from the master branch is ‘successful’, update the build/master branch to point to its build commit. Now your application servers can just pull from the build/master branch as part of your deployment process. This works just as well for ‘compiled’ applications as as for ‘interpreted’ applications. There is a slight snag here —updates to build branches are not fast-forwards, since each build commit branches off from the mainline source branch. You might solve this by constructing new commits on the build branch so that they have build commits as parents as well as source commits; that is, build/master repeatedly merges from master and commits the updated build artifacts.

Tracking deployments with branches

Say you have two application servers. How do you track which version of your application they are running? A dead simple way is to have two branches, deploy/app1 and deploy/app2. When the CI server deploys the latest build/master to app server 1, it fast-forwards the deploy/app1 branch to that commit. Obvious? Seems like a perfect use for branches and I’m surprised I don’t see everyone doing it.

Thoughts?

Problems you anticipate with this? Suggestions for other schemes? Are people already doing something like this? Let me know!

Thanks to Arialdo Martini for the sensible suggestion of using a separate remote to categorize ‘artifact’ commits as distinct from source commits.