Monorepos: Please don’t!
Here we are at the beginning of 2019 and I’m engaged in yet another discussion on the merits (or lack thereof) of keeping all of an organization’s code in a “monorepo.” For those of you not familiar with this concept, the idea behind a monorepo is to store all code in a single version control system (VCS) repository. The alternative, of course, is to store code split into many different VCS repositories, usually on a service/application/library basis. For the purpose of this post I will call the multiple repository solution a “polyrepo.”
Some of tech’s biggest names use a monorepo, including Google, Facebook, Twitter, and others. Surely if these companies all use a monorepo, the benefits must be tremendous, and we should all do the same, right? Wrong! As the title says: please, do not use a monorepo! Why? Because, at scale, a monorepo must solve every problem that a polyrepo must solve, with the downside of encouraging tight coupling, and the additional herculean effort of tackling VCS scalability. Thus, in the medium to long term, a monorepo provides zero organizational benefits, while inevitably leaving some of an organization’s best engineers with a wicked case of PTSD (manifested via drooling and incoherent mumbling about git performance internals).
A quick note: what do I mean by “at scale?” There is no definitive answer to this question, but because I know I will be asked, let’s say for the sake of discussion that at scale means over 100 developers writing code full time.
Theoretical monorepo benefits and why they cannot be achieved without polyrepo-style tooling (or are a lie)
Theoretical benefit 1: Easier collaboration and code sharing
Monorepo proponents will argue that when all code is present within a single repository, the likelihood of code duplication is small, and it’s more likely that different teams will collaborate together on shared infrastructure.
Here is the ugly truth about even a medium size monorepo (and this will be a recurring theme throughout this section): it quickly becomes unreasonable for a single developer to have the entire repository on their machine, or to search through it using tools like grep. Thus, any monorepo that hopes to scale must provide two things:
- Some type of virtual file system (VFS) that allows a portion of the code to be present locally. This might be accomplished via a proprietary VCS like Perforce which natively operates this way, via Google’s “G3” internal tooling, or Microsoft’s GVFS.
- Sophisticated source code indexing/searching/discovery capabilities as a service. Since no individual developer is going to have all code in a searchable state, it’s critical that there exists some capability to perform a search across the entire codebase.
Given that a developer will only access small portions of the codebase at a time, is there any real difference between checking out a portion of the tree via a VFS or checking out multiple repositories? There is no difference.
In terms of source code indexing/searching/discovery capabilities, a tool can trivially iterate over many repositories and collate the results. In fact, this is how GitHub’s own searching capabilities work as well as newer and more sophisticated indexing and collaboration tools such as Sourcegraph.
Thus, in terms of collaboration and code sharing, at scale, developers are exposed to subsections of code through higher layer tooling. Whether the code is in a monorepo or polyrepo is irrelevant; the problem being solved is the same, and the efficacy of collaboration and code sharing has everything to do with engineering culture and nothing to do with code storage.
Theoretical benefit 2: Single build / no dependency management
The next thing that monorepo proponents will typically say is that by having all code in a single repository, there is no need for dependency management because all source code is built at the same time. This is a lie! At scale, there is simply no way to rebuild the entirety of a codebase and run all automated tests when each change is submitted (or, more importantly and more often, in CI when a change is proposed). To deal with this problem, all of the large monorepos have developed sophisticated build systems (see Bazel/Blaze from Google and Buck from Facebook as examples) that are designed in such a way as to internally track dependencies and build a directed acyclic graph (DAG) of the source code. This DAG allows for efficient build and test caching such that only code that changes, or code that depends on it, needs to be built and tested.
Furthermore, because code that is built must actually be deployed, and not all software is deployed at the same time, it is essential that build artifacts are carefully tracked so that previously deployed software can be redeployed to new hosts as needed. This reality means that even in a monorepo world, multiple versions of code exist at the same time in the wild, and must be carefully tracked and reconciled.
Monorepo proponents will argue that even with the large amount of build/dependency tracking required, there is still substantial benefit because a single commit/SHA describes the entire state of the world. I would argue this benefit is dubious; given the DAG that already exists, it’s a trivial leap to include individual repository SHAs as part of the DAG, and in fact, Bazel can seamlessly work across repositories or within a single repository, abstracting the underlying layout from the developer. Furthermore, automated refactor tooling can trivially be built that automatically bumps dependent library versions across many repositories, thus blurring the difference between a monorepo and polyrepo in this area (more on this below).
The end result is that the realities of build/deploy management at scale are largely identical whether using a monorepo or polyrepo. The tools don’t care, and neither should the developers writing code.
Theoretical benefit 3: Code refactors are easy / atomic commits
The final benefit that monorepo proponents typically tout is the fact that when all code is in a single repository, it makes code refactors much easier, due to ease of searching and the idea that a single atomic commit can span the entire codebase. This is a fallacy for multiple reasons:
- As described above, at scale, a developer will not be able to easily edit or search the entirety of the codebase on their local machine. Thus, the idea that one can clone all of the code and simply do a grep/replace is not trivial in practice.
- If we assume that via a sophisticated VFS a developer can clone and edit the entire codebase, the next question is how often does that actually happen? I’m not talking about fixing a bug in an implementation of a shared library, as this type of fix is identically carried out whether using a monorepo or polyrepo (assuming similar build/deploy tooling as described in the previous section). I’m talking about a library API change that has follow-on build breakage effects for other code. In very large code bases, it is likely impossible to make a change to a fundamental API and get it code reviewed by every affected team before merge conflicts force the process to start over again. Developers are faced with two realistic choices. First, they can give up, and work around the API issue (this happens more often than we would like to admit). Second, they can deprecate the existing API, implement a new API, and then go through the laborious process of individual deprecation changes throughout the codebase. Either way, this is exactly the same process undertaken in a polyrepo.
- In a service oriented world, applications are now composed of many loosely coupled services that interact with each other using some type of well specified API. Larger organizations inevitably migrate to an IDL such as Thrift or Protobuf that allow for type-safe APIs and backwards compatible changes. As described in the previous section on build/deploy management, code is not deployed at the same time. It might be deployed over a period of hours, days, or months. Thus, modern developers must think about backwards compatibility in the wild. This is a simple reality of modern application development that many developers would like to ignore but cannot. Thus, when it comes to services, versus library APIs, developers must use one of the two options described above (give up on changing an API or go through a deprecation cycle), and this is no different whether using a monorepo or polyrepo.
In terms of actually making refactor changes across large codebases, many organizations end up developing automated refactor tooling such as fastmod, recently released by Facebook. As elsewhere, a tool such as this can trivially operate within a single repository or across multiple repositories. Lyft has a tool internally called “refactorator” which does just this. It works like fastmod but automates making changes across our polyrepo, including opening PRs, tracking review status, etc.
Unique monorepo downsides
In the previous section I laid out all of the theoretical benefits that a monorepo provides, and explained why in order to realize them, extraordinarily complex tooling must be developed that is no different to what is required for a polyrepo. In this section, I’m going to cover two unique downsides to monorepos.
Downside 1: Tight coupling and OSS
Organizationally, a monorepo encourages tight coupling and development of brittle software. It gives developers the feeling they can easily fix abstraction mistakes, when they actually cannot in the real world due to the realities of staggered build/deploy and the human/organizational/cultural factors inherent in asking developers to make changes across the entire codebase.
Polyrepo code layout offers clear team/project/abstraction/ownership boundaries and encourages developers to think carefully about contracts. This is a subtle yet hugely important benefit: it imbues an organization’s developers with a more scalable and long-term way of thinking. Furthermore, the use of a polyrepo does not mean that developers cannot reach across repository boundaries. Whether this happens or not is a function of the engineering culture in place versus whether the organization uses a monorepo or polyrepo.
Tight coupling also has substantial implications with regard to open source. If an organization wishes to create or easily consume OSS, using a polyrepo is required. The contortions that large monorepo organizations undertake (reverse import/export, private/public issue tracking, shim layers to abstract standard library differences, etc.) are not conducive to productive OSS collaboration and community building, and also create substantial overhead for engineers within the organization.
Downside 2: VCS scalability
Scaling a single VCS to hundreds of developers, hundreds of millions lines of code, and a rapid rate of submissions is a monumental task. Twitter’s monorepo roll-out about 5 years ago (based on git) was one of the biggest software engineering boondoggles I have ever witnessed in my career. Running simple commands such as
git status would take minutes. If an individual clone got too far behind, it took hours to catch up (for a time there was even a practice of shipping hard drives to remote employees with a recent clone to start out with). I bring this up not specifically to make fun of Twitter engineering, but to illustrate how hard this problem is. I’m told that 5 years later, the performance of Twitter’s monorepo is still not what the developer tooling team there would like, and not for lack of trying.
Of course, the past 5 years has also seen development in this area. Microsoft’s git VFS which is used internally to develop Windows, has tackled creating a real VFS for git, as I described above, as a requirement for monorepo scalability (and with Microsoft’s acquisition of GitHub it seems likely this level of git scalability will find its way into GitHub’s enterprise offerings). And, of course, Google and Facebook continue to invest tremendous resources into their internal systems to keep them running, although none of this work is publicly available.
However, why bother solving the VCS scalability problem at all when, as described in the previous section, tooling will also need to be built that is identical to what is required for a polyrepo? There is no good reason.
As is often the case in software engineering, we tend to look at tech’s most successful companies for guidance on best practices, without understanding the monumental engineering that has gone into making those companies successful at scale. Monorepos, in my opinion, are an egregious example of this. Google, Facebook, and Twitter have invested extensively in their code storage systems, only to wind up with a solution that is no different from what is required when using a polyrepo, yet leads to tight coupling and requires a substantial investment in VCS scalability.
The frank reality is that, at scale, how well an organization does with code sharing, collaboration, tight coupling, etc. is a direct result of engineering culture and leadership, and has nothing to do with whether a monorepo or a polyrepo is used. The two solutions end up looking identical to the developer. In the face of this, why use a monorepo in the first place? Please don’t!