Speeding up CircleCI Checkouts in a Monorepo
An investigation into the depths of git in 5 easy steps!
This is part of a blog series on managing monorepos. If you haven’t already, check out some of our previous posts on the subject. More to come in the next few weeks!
We’re pretty heavy users of the monorepo git repository design at Compass. With over 350 engineers collaborating on various parts of our stack, monorepos help us maintain high code quality and good standards across the organization.
However, as the number of engineers (and the length of git history) has scaled, we started to notice that some parts of our infrastructure were getting a bit creaky and slow. Such was the case with cloning our Frontend repository via CircleCI.
Slow your (checkout) roll
CircleCI is a very popular CI/CD pipeline which we use at Compass to run our suite of automated continuous integration jobs. The setup is conceptually simple: each job starts up in an isolated environment, checks out the code it needs from your git repository, runs your suite of tests, and then reports back on success or failure. A simple configuration might look something like this:
jobs:
test:
steps:
- checkout
- run_tests
The benefits of having a completely fresh and isolated environment each time a job runs are many. But one thing we noticed is that over time, each job started taking longer and longer.
After some investigation, we noticed that the culprit was the checkout
step, which was now taking upwards of 3 minutes (!) and downloading 2.3GB (!!!) for every job. That’s 3 minutes a developer is waiting around before their tests could even start running. Test jobs are usually under 1 minute, so checking out the repo was at times accounting for 75% of the execution time of the job!
It was time to dig in.
Step one: be wrong at least once
Our first reaction was to blame our monorepo. Surely the fact that this git repository hosts all of the code for our Frontend NodeJS and Javascript was to blame. All of that code must add up!
A fresh clone of our repo locally confirmed what we were seeing in CircleCI: we were pulling down over 2.3GB of data when cloning the repo:
Cloning into 'frontend-test'...
remote: Enumerating objects: 473, done.
remote: Counting objects: 100% (473/473), done.
remote: Compressing objects: 100% (158/158), done.
remote: Total 1765770 (delta 356), reused 347 (delta 314), pack-reused 1765297
Receiving objects: 100% (1765770/1765770), 2.37 GiB | 5.66 MiB/s, done.
Resolving deltas: 100% (1337187/1337187), done.
Updating files: 100% (37753/37753), done.
Had we really allowed our repo to get that large? Fearing the worst, we quickly ran the du
command to display our disk usage in our freshly cloned repo (which won’t have things like node_modules/
throwing off the stats yet):
$ cd frontend-test/
$ du -hs * | sort -hr
710M apps
188M packages
4.6M scripts
1.1M lambdas
52K codecov.yml
28K deploy-config
12K README.md
But something was (quite literally) not adding up here. According to these stats, the repo was under 900mb (still large, but way more manageable). A quick follow-up command gave us a clue:
$ du -hs .git
2.4G .git
Aha! So over 60% of what we were pulling down was not files currently being worked on in the master branch, but was data related to the git repository itself.
This must mean we had some large files somewhere in git history.
Step two: be wrong again!
Maybe if we looked through the git history and deleted some of the larger files, we could get the size down, and therefore the checkout time down.
With a new hypothesis in hand, we set about finding what we could safely delete from git history that was old enough that it wouldn’t be relevant anyway. We reached for the wonderful git-filter-repo
tool and had it run an analysis of our repo:
$ cd frontend
$ git-filter-repo --analyze
Writing reports to .git/filter-repo/analysis... done.
Unfortunately, the reports didn’t line up with our hypothesis. The largest file in our entire git history was only maybe 40MB, and the maximum file sizes dropped off pretty quickly from there. It certainly didn’t add up to the extra difference we were seeing between our files checked in and the .git/
directory.
Step three: just… try things?
If you’ve made it this far you’ve probably gotten here while yelling at your screen: “Why didn’t you just do a shallow clone?!?!”. Well, I’m here to tell you why!
A git shallow clone checks out the repo to only a certain depth in git history. For example, if you just want the files from the tip of your repo, you can provide a depth of 1:
git clone git@github.com:some-repo.git --depth=1
After your repo is checked out, if you run a git log
you’ll notice that you only have 1 commit in your history. Problem solved, right?
Unfortunately, not without some additional pain. Our monorepo design isolates all of our apps and packages into their own directories, so we’re able to do git log
commands limited to just those directories to decide some pretty important things (like what/when to publish or deploy). The --depth
flag when cloning a repo doesn’t let us say “we want a shallow depth, but only for this 1 folder!”. It’s repo-wide, so there’s no number we could provide that would ensure we’d have git history for a given sub-directory.
But, we could at least try it to see what our gains would theoretically be…
Cloning into 'frontend-test'...
remote: Enumerating objects: 40610, done.
remote: Counting objects: 100% (40610/40610), done.
remote: Compressing objects: 100% (34133/34133), done.
remote: Total 40610 (delta 8511), reused 21245 (delta 4300), pack-reused 0
Receiving objects: 100% (40610/40610), 530.50 MiB | 4.90 MiB/s, done.
Resolving deltas: 100% (8511/8511), done.
Updating files: 100% (37753/37753), done.
Huzzah! This time we only pulled down 530MB and completed in around 45 seconds! While this couldn’t be a full solution to our problem, it at least confirmed our suspicions: something about our git history was clogging up the works. It also gave us a target best-case scenario.
Step 4: Google all the things
A quick Google and Stack Overflow search brought us to learn about sparse checkouts. Built-in to git 2.25.0 — and actually available in slightly older versions of git using a different API— a sparse checkout seemed tailor-made to fix our problem. It allows cloning of specific directories in a repository while maintaining their git history. Since our monorepo is designed in such a way that we enforce encapsulation and isolation of our apps and packages, we could simply check out one app or package (and the scripts folder) and be able to run our entire CI/CD pipelines as-is.
The problem of course remains that we need to figure out what folders to check out before the clone, but that’s a problem for Tomorrow Joe. Today Joe is happy to just try something new:
$ mkdir frontend-test && cd frontend-test
$ git init
$ git sparse-checkout init --cone
$ git sparse-checkout set scripts/ apps/app-one
# The -f flag performs a fetch immediately after adding the remote
$ git remote add -f origin "your-git-remote-here.git"
$ git pull origin master
Not exactly a one-liner like a git clone
, but we can always turn it into a script later if we need to. Now to bask in the glory of our success:
Updating origin
remote: Enumerating objects: 370, done.
remote: Counting objects: 100% (370/370), done.
remote: Compressing objects: 100% (33/33), done.
remote: Total 1763309 (delta 354), reused 338 (delta 337), pack-reused 1762939
Receiving objects: 100% (1763309/1763309), 2.37 GiB | 5.03 MiB/s, done.
Resolving deltas: 100% (1335565/1335565), done.
Well crud, we’re back to where we started: 2.37GB again. On the bright side, once pulled down we did only end up with those two directories on disk, but the entire problem we were hoping to solve (reducing the size of the download) was still a problem while doing a sparse checkout.
Step 5: If you try sometimes, you’ll find, you get (just) what you need
At this point —after Googling some more to no avail—I stopped for a second to think about the actual problem we had at hand. The issue seems to be that we’re pulling down too much git data with the default checkout command. If we limit it to a shallow clone, we’re able to cut that data down, but lose git history. But a sparse checkout didn’t help, meaning it wasn’t the history of individual directories that was causing the extra data.
We needed to better understand why shallow clones helped when sparse checkouts did not. So I pulled up the git-clone
documentation and got to reading, and almost immediately the 2nd sentence caught my eye
Implies
--single-branch
unless--no-single-branch
is given to fetch the histories near the tips of all branches.
And that’s when it hit me like a bolt of lightning: branches. You see, not only do we operate a monorepo, we also use Pull Request branches (instead of forks) when doing our code review. Developers create branches for their own work, then push those local branches to remote branches, where it’s reviewed and merged into master.
Why is this relevant? Well it turns out that by default, a git clone pulls down all of the commit data on every remote branch for that remote. We have over 150 developers actively working on this repo, and a current count of about 2,500 (!!!) remote branches and over 25,000 tags (!!!!!!). I think we just found our extra 1.5GB of data. Our CI/CD pipeline only cares about the current branch it’s testing, leaving 2,499 irrelevant branches that are being pulled down.
So I immediately tried the flag git was warning me about, git clone --single-branch
, and what do you know?!
Cloning into 'frontend-test'...
remote: Enumerating objects: 65848, done.
remote: Counting objects: 100% (65848/65848), done.
remote: Compressing objects: 100% (5536/5536), done.
remote: Total 557681 (delta 63786), reused 60313 (delta 60312), pack-reused 491833
Receiving objects: 100% (557681/557681), 938.72 MiB | 5.54 MiB/s, done.
Resolving deltas: 100% (414913/414913), done.
Updating files: 100% (37754/37754), done.
Now we’re talking! We’re pulling down just 938MB this time, and the entire clone completed in around 50–55 seconds. Not quite as good as the 530MB/45s from doing a shallow clone, but this is real progress!
All that was left was adding a custom checkout command into our CircleCI config file that ran the following script:
branch="${CIRCLE_BRANCH:-master}"
sha=${CIRCLE_SHA1:-master}"mkdir -p ${CIRCLE_PROJECT_REPONAME}
cd ${CIRCLE_PROJECT_REPONAME}git init
git remote add origin ${CIRCLE_REPOSITORY_URL} -t master -t ${branch} -f --no-tagsif [ -n "${CIRCLE_TAG}" ]; then
git fetch --tags
git checkout -q "${CIRCLE_TAG}"
elif [ -n "${branch}" ]; then
git checkout -B "${branch}" "${sha}"
figit reset --hard "${sha}"
You’ll notice a few small changes. First, we were actually unable to use the --single-branch
flag directly. The reason is it works exactly as it says on the tin: it pulls down a single branch. The problem with that is when we’re pulling down a non-master branch, we want to do some git diff
operations against master so that we know what’s changed in the branch. So by using the -t
flag on git-remote
we were able to specify two branches: our current branch we’re checking out, and the master branch. You’ll also notice we added the --no-tags
flag as well. We have an incredible number of tags (the repo is 3 years old, and we’ve made a lot of releases in that time), and so we won’t pull them down for every job. For jobs that need tags, they just run git fetch --tags
at the start.
This simple change has provided some enormous upside for our entire organization. First of all, developers spend way less time waiting around. A quick look at our stats for the previous month showed we ran 64,621 jobs that month for just our frontend repo. If I do some quick back-of-the-envelope math whereby we assume around 30 days in a month, and around 100 seconds saved per job:
64621 / 30 * 100 / 60
We come out to a whopping 3,590 minutes saved per day! Obviously a lot of these jobs run in parallel, but still an impressive metric.
And it’s not just developer productivity we’re saving here. CircleCI’s billing model is based on credits, which is heavily affected by the amount of time your jobs take to run (among other things), so by saving on job build-time we’re also saving the company a lot on our billing.
If you’ve got a git repository that’s worked on by a large team, give this a try. You might be surprised how much it can save you. And if you’re looking for a fun place to work tackling interesting challenges, let us know, we’re hiring!