Implementing “build from source” with a mono-repo vs a poly-repo

Emmanuel Debanne
Criteo Tech Blog
Published in
8 min readMar 31, 2020
Mono-arch bridges can require lot of tooling (source)

Introduction

In a previous article, we explained that the real need behind mono-repos is to implement the pattern that we named “build from source”. However one could argue that a mono-repo provides an obvious and straightforward implementation of this pattern. We’re going to compare the impact of both organizations of the code, mono-repo vs poly-repo, when implementing “build from source”. First, in terms of features, then, in terms of implementation cost.

NB1: Please keep in mind that the views expressed here are influenced by the codebase we host at Criteo: A high number of web applications and internal services that allow and require frequent upgrades, organized into hundreds of repositories.

NB2: A part of the present material comes from a Meetup event — The Continuous Delivery at Criteo — that took place in June 2017.

The comparison regarding the ideal developer experience

Let’s imagine a company with N teams, each one working on a code set. Usually, with multiple repositories, and when not “building from source”, each code set:

  • is “owned” by its team — i.e. only the members of the team decide to merge or not the commits
  • is reviewed independently from the other code sets, even when a change impacts multiple code sets
  • has a dedicated repository
  • requires an independent commit, even when a change impacts multiple code sets
  • is built independently from the other code sets

This can be summarized with this table:

When “building from source” with a straightforward implementation based on a mono-repo, each code set:

  • is owned by the whole company
  • cannot be reviewed independently from the other code sets
  • belongs to a common repository
  • can be impacted via a single commit common to all code sets
  • is built in a common pipeline

This is different from the ideal situation where each code set:

  • is owned by its team
  • can be reviewed independently from the other code sets
  • can be impacted via a single commit (thus atomic) common to all code sets
  • is built in a common pipeline

This ideal situation is not defined by the number of repositories but by some important features:

  • fine-grained ownership/reviews,
  • ability to test and merge a change that impacts several code sets.

At Google “build from source” is implemented with a mono-repo and a lot of tooling to keep a fine-grained ownership. That leads to a better situation than the straightforward implementation:

There is even a tool — named Rosie — to split a big change in the mono-repo into several commits:

At Criteo, we kept the multiplicity of repositories and created tools to build them from source in a single pipeline:

Technically there are N commits but they are merged and built simultaneously as if they were an “atomic” commit. The code reviews that are aggregated to simulate a single commit are called “cross-repo reviews”.

So both a mono-repo and a poly-repo organization of the code allows getting close to the ideal situation we described when “building from source”.

The comparison regarding the implementation cost to “build from source”

What about the implementation cost? Contrary to a preconceived opinion, the mono-repo implementation does not come for free and requires some tooling as the poly-repo one.

During the last year, several Medium articles started to warn about the risk to underestimate these costs. The article Monorepos: Please don’t! reminded that the “theoretical monorepo benefits […] cannot be achieved without polyrepo-style tooling”. In Effective meta-repo CI/CD pipelines, we were warned that “ most of us aren’t Google and tooling is just not there yet “. The same point of view is shared in Death to the Angular Mono-Repo : “for everyone [but Google, mono-repo] is simply not scalable; a house of cards waiting to collapse”.

It is indeed important to warn software engineers that they might have issues when applying a model designed for a code set and a headcount that is typically 1000 times larger than theirs.

Let’s list the necessary tooling required to implement “build from source” and make it as comfortable to work with as when “building with versioned artifacts”. The items are listed by decreasing order of importance. For each item, we compare the mono-repo vs the poly-repo implementations.

Partial checkout/build of 1 or more code sets

In the case of a mono-repo, you will have to implement what Google calls “sparse checkout”: a way to not have to checkout nor build everything from source. As for the poly-repo structure, you will have to implement the aggregation of checked out repositories. Note that we are not talking just about which part of the source code to checkout, but also on the way the internal dependencies are declared and found at build time. Being able to switch from dependencies built locally from source to dependencies built in the CI, is certainly the more complex part.

Also, both code organization will benefit from distributing the build on several machines.

Ownership

The applications allowing code reviews usually have a built-in mechanism to manage ownership rules per repository. So in the case of a poly-repo, no additional tooling is required. Applications that also allow ownership of sub-directories are less frequent. So mono-repos will often require some tooling to support ownership.

Links between reviews and commits

In the case of a mono-repo, a single commit should be split into multiple reviews. Whereas in the case of a poly-repo, the multiple commits/reviews should be aggregated to simulate a single commit — atomic.

Performance issues on very large mono-repos

Git was designed to manage the repository of the Linux kernel which is considered as big, but some companies have a much larger codebase. There are examples in the literature of 100 times bigger repositories. These huge mono-repos can become difficult to maintain (regarding checkout performance for example). Two very large mono-repos — Microsoft and Google — even had to switch to virtual file systems.

At Microsoft, they used to build Windows from 40 repositories. In 2017 they switched to a mono-repo and introduced Git Virtual File System:

“Git Virtual File System” (GVFS) enables Git to scale to very large repos by virtualizing both the .git folder and the working directory. Rather than download the entire repo and checkout all the files, it dynamically downloads only the portions you need based on what you use. GVFS relies on a new Windows filter driver (the moral equivalent of the FUSE driver in Linux).
(The largest Git repo on the planet)

At Google, a main mono-repo is used since the early days of the company. They created their own version control system (VCS) called Piper. The code is accessed remotely via Clients in the Cloud (CitC), a cloud-based storage backend and a Linux-only FUSE file system. Local checkouts are only preferred by less than 20% of the users. The tool-chain is also online. (Cf Why Google stores billions of lines of code in a single repository .)

The mono-repo of Facebook has a size of the same magnitude as the one of the Linux kernel but still, they decided for performance reason to switch from Git to Mercurial in 2013.

The following table provides an estimate of the size of the HEAD of the mono-repos we mentioned:

Support of open-sourced repositories

Going for a mono-repo creates an uncomfortable situation for the code that is open-sourced and thus requires to be published on its own repository. In the case of a poly-repo, an open-sourced repository can just be one of the multiple repositories and not require some specific tooling.

Secrecy of sensitive data

As for the “ownership”, applications for code reviews or code search rarely allow tuning the “read access” per directory. So if some part of the code requires limited access, it is usually easier to do this per repository than per sub-directory.

Summary

The following table recapitulates the required tooling by order of importance to implement “build from source”:

Conclusion

We have shown that both a mono-repo and a poly-repo are usable to implement the pattern “build from source”. The benefits of a mono-repo being less and less obvious as soon as there are more teams contributing to the codebase. In both cases you will need to develop some tooling because, until now, our typical developer tools (Git, Maven, MSBuild, etc) have mostly ignored the above build patterns:

  • Build from source,
  • Use a unique version for external components.

“Git sub-modules” or “snapshot versions in Maven” can be seen as early rudimentary attempts towards this, conceived as afterthought features. With newer tools, we feel the wish to make these patterns first citizens. For example, Gradle comes with composite builds, Bazel provides local repositories and remote builds, and Lerna’s bootstrap command is able to automatically switch from “build from released artifacts” to “build from source”. We can expect that the build tools will more and more ease the adoption of these patterns.

You’ve missed our first article on this topic? Click Below!

Would you like to be part of our journey? Then read on about our R&D teams:

--

--