Mono-Repos @ Google. Are they worth it?

Google’s Rachel Potvin made a presentation during the @scale conference titled “Why Google Stores Billions of Lines of Code in a Single Repository”.

Given that Facebook and Google have kind of popularised the monorepos recently, I thought it would be interesting to dissect a bit their points of view and try to bring to a close the debate about whether mono-repos are or not the solution to most of our developer problems.

Unfortunately, the slides are not available online, so I took some notes, which should summarise the presentation.

Rachel starts by discussing a previous job where she was working in the gaming industry. She mentions the teams working on multiple games, in separate repositories on top of the same engines. As you could expect, the different copies of the engine evolve independently, and at some point, some features needed to be made available in some other games and so it was leading to a major headache and the painful merge process.

At Google, they’ve had a mono-repo since forever, and I recall they were using Perforce but they have now invested heavily in scalability of their mono-repo. Rachel will go into some details about that.

Their repo is huge, and they documentation, configuration files, supporting data files (which all seem OK to me) but also generated source (which, they have to have a good reason to store in the repo, but which in my opinion, is not a great idea, as generated files are generated from the source code, so this is just useless duplication and not a good practice.)

Rachel then mentions that developers work in their own workspaces (I would assume this a local copy of the files, a Perforce lingo.)

She mentions the mono-repo is a giant tree, where each directory has a set of owners who must approve the change. I would challenge the fact that having owners is not in the best interest of shared ownership, so I’m not a fan. Additionally, this is not a direct benefit of the mono-repo, as segregating the code into many repos with different owners would lead to the same result.

They also have tests and automated checks which are performed before and after each commit (Yey!)

Custom tools developed by Google to support their mono-repo

  • Piper (custom system hosting monolithic repo)
  • CitC (UI ?)
  • Critique (code review)
  • CodeSearch (code browsing, etc.)
  • Tricorder (static code analyser)
  • Presubmits (kind of hooks?)
  • TAP (testing before and after commits, auto-rollback)
  • Rosie (large scale change distribution and management)

Google does trunk based development (Yey!!) and branching is exceedingly rare (more yey!!). They are used only for release branches

An important point is that both old and new code path for any new features exist simultaneously, controlled by the use of conditional flags, allowing for smoother deployments and avoiding the need for development branches

Stated advantages of the mono-repo

1- unified versioning, one source of truth

1.1 → no confusion about which is the authoritative version of a file [This is true even with multiple repos, provided you avoid forking and copying code]

1.2 → no forking of shared libraries [This is true even with multiple repos, provided you avoid forking and copying code, forking shared libraries is probably an anti-pattern]

1.3 → no painful cross-repository merging of copied code [Do not copy code please]

1.4 → no artificial boundaries between teams/projects [This is absolutely true even with multiple repos and the fact that Google has “owners” of directories which control and approve code changes is in opposition to the stated goal here]

1.5 → supports gradual refactoring and re-organisation of the codebase [This is indeed made easier by a mono-repo, but good architecture should allow for components to be refactored without breaking the entire code base everywhere]

2. extensive code sharing and reuse [This is not related to the mono-repo]

3. simplified dependency management [Probably, though debatable]

3.1 → diamond dependency problem: one person updating a library will update all the dependent code as well

3.2 → Google statically links everything (yey!)

4. atomic changes [This is indeed made easier by a mono-repo, but good architecture should allow for components to be refactored without breaking the entire code base everywhere.]

4.1 → make large, backwards incompatible changes easily [Probably easier with a mono-repo]

4.2 → change of hundreds/thousands of files in a single consistent operation

4.3 → rename a class or function in a single commit, with no broken builds or tests

5. large scale refactoring, code base modernization [True, but you could probably do the same on many repos with adequate tooling — applies to all points below]

5.1 → single view of the code base facilitates clean-up, modernization efforts

5.1.1 → → can be centrally managed by dedicated specialists

5.1.2 → → e.g. updating the codebase to make use of C++11 features

5.2 →monolithic codebase captures all dependency information

5.2.1 → → old APIs can be removed with confidence

6. collaboration across teams [Not related to mono-repos, but to permissioning policies]

7. flexible team boundaries and code ownership [This is absolutely true even with multiple repos and the fact that Google has “owners” of directories which control and approve code changes is in opposition to the stated goal here]

8. code visibility and clear tree structure providing implicit team namespacing [True, but you could probably do the same on many repos with adequate tooling — and BitBucket or GitHub are providing some of the required features]

Costs associated with the model

  1. tooling investments
  2. codebase complexity is a risk to productivity
  3. code health must be a priority. Tools have been built to:

3.1 → find and remove unused/underused dependencies and dead code

3.2 → support large scale clean-ups and refactoring


My conclusions

1. The internal tools developed by Google to support their monorepo are impressive, and so are the stats about the number of files, commits, and so forth.

2. I would however argue that many of the stated benefits of the mono-repo above are simply not limited to mono repos and would work perfectly fine in a much more natural multiple repos

3. Should you have the same deep pocket and engineering fire power as Google, you could probably build the missing tools for making it work across multiple repos (for example, adequate search across many repos, or applying patches and running tests a group of repos instead of a single repo).

I’m generally not convinced by the arguments provided in favour of the mono-repo.

Appendix

Google still has a “Git infrastructure team” mostly for open source projects : https://www.youtube.com/watch?v=cY34mr71ky8

Link to the research papers written by Rachel and Josh on “Why Google Stores Billions of Lines of Code in a Single Repository”