Monorepos in the Wild
Monolithic approaches to software development have quite a bad reputation, and rightfully so. Modularization and micro services dictate the current landscape especially in the web development world.
But still there is a relatively new trend coming up which — at first glance — seems to contradict the push for modularization. The Monorepo. Although, Monorepos are not in the least incompatible with modular software development practices — quite the opposite — managing code in a single repository can simplify the development process of modular software projects, like micro services based infrastructures, tremendously.
Not only big tech giants like Google and Facebook are using monolithic approaches for managing their enormous codebases. Also popular open source projects like Babel or React are successfully using Monorepos to maintain all their official packages in a single repository.
Although there are many benefits building modular software which consists of multiple packages (intended to be published in a package management system like Composer or npm) in a single repository, it is not without it’s own problems and challenges. The developers behind the Babel project have built a tool called Lerna which can help to streamline the npm publishing workflow of Monorepo based projects.
In this article we’re going to explore why managing modular projects in Monorepos is becoming more and more popular not only among big companies, but also among medium to large sized open source projects. Furthermore we’ll look at potential ups and downs of Monorepos versus Manyrepos. Last but not least I’ll going to show you how to use Lerna and which problems I could solve transforming one of my own projects from Manyrepos to a Monorepo and I’ll explain the process of migrating the code of many standalone repositories into one single repository.
What exactly is a Monorepo?
Definitions vary but speaking very broadly, a Monorepo is a single repository holding the code of multiple projects which may or may not be related in some way. The projects managed in a Monorepo can be dependents of each other (like React and the react-dom package) or they can be completely unrelated (like the Google search algorithm and Angular) respectively only related because both are Google projects.
One example would be an organization managing the code for their website, their iOS and Android apps and the API which is powering both, in one large repository. Although this might sound scary at first there ought to be a reason why big tech companies like Google, Facebook and Twitter have chosen the Monorepo approach to manage the complexity of their enormous codebases.
But managing huge codebases consisting of dozens of related and unrelated projects, is not the only use case where Monorepos are shining, there are also smaller projects which can benefit from implementing a Monorepo approach.
Imagine projects like React, Ember or Symfony, which consist of a core package which can be extended by additional components and functionality in the form of packages. Managing the projects core functionality and optional components in one single repository makes it a lot easier to maintain the code and keep everything in sync.
Because using a Monorepo to store all the code of a (tech) company or using a Monorepo approach at a project level, are quite different use cases, I coined two terms to be more specific about their nature: “Monstrous Monorepos” in regard to the sheer size to which Monorepos at organizations can grow, and “Project Monorepos” to describe single repositories which are used to manage the core functionality of a project and all it’s official components.
In the following chapters of this article I’m going to focus on Project Monorepos because this is the area where I could already gain some experience. But we’re also going to take a quick look at the gigantic Monorepos which are powering Google, Facebook and Twitter before taking a closer look at how to use single repositories to manage medium to large scale modular projects.
Gigantic Monorepos as they are used by tech giants like Google, Facebook and Twitter are impressive in terms of their magnitude. Those repositories are grown to contain gigabytes or — in the case of Google — even terabytes of data accounting for millions of lines of code. As it’s almost inevitable with codebases of this size, there are a lot of challenges involved managing those systems. There are technical, and sometimes even physical limits, which are reached when dealing with complexity at this level.
Twitter is using a custom build of Git which includes patches enhancing it to better deal with very large repositories in the range of multiple gigabytes. While Git is known as an excellent version control system and it is also famous for being capable to handle large projects like the Linux kernel very well, there seems to be a size limit at which you’ll face performance hits. The file size of the Linux kernel repository at about 180 Mb is still pretty small compared to the Twitter Monorepo estimated to be several gigabytes in file size. Commands like `git log`, `git blame` or `git commit` become slower the bigger the Git history gets, and according to current and former Twitter employees, those essential, and usually instantaneous commands, can take up an inordinately long amount of time to execute.
Facebook engineers stating complex dependencies and continuous modernization, which leads to large changes throughout the codebase, as the main factors why a Monorepo approach is serving them best. Although, splitting up the codebase in smaller parts managed in separate repositories, would make it easier to handle the technical limitations of various source control systems, doing so would make atomic refactorings more difficult and thats the main reason why developers at Facebook decided against it.
Facebook is using Mercurial as their version control system of choice to manage the enormous codebase, stating performance concerns as of why they decided against Git. In order to serve their needs even better, Facebook is heavily committed on improving Mercurial for handling repositories at ultra-large-scale.
If you thought the term “Monstrous Monorepo” is a little over sensational, let me tell you some facts about the Google Monorepo. Over the last 17 years of Google history, their single shared repository containing almost all of Google’s software assets, has grown to include approximately one billion files. Nine million unique source files containing about two billion lines of code, together with other files, add up to approximately 86TB of data. More than 25.000 Google engineers from dozens of offices around the globe are committing changes to the codebase on a daily basis.
Because there are no tools out there which would be capable of handling version control systems containing multiple terabytes of files, Google built there own proprietary system named Piper to vend this extraordinary large codebase. Because of the sheer file size of the system, Google software developers are using so called Clients in the Cloud to access Piper.
While technical limitations and performance considerations are a huge burden when working with Monstrous Monorepos at large scale companies like Twitter or Facebook, Project Monorepos — which are tiny in comparison — do not, or very rarely, face the same problems.
Project Monorepos are successfully used by popular open source projects like Babel or React. One reason why monolithic source control approaches are very popular among a certain type of open source projects is that it can be pretty tedious to manage a huge amount of related projects in separate repositories. The Babel repository includes more than 100 separate npm packages, maintaining every one of those packages in it’s own Git repository would be an enormous overhead and not very practical.
Making large scale refactorings across all related packages can be done very quickly if every package is maintained in one single repository. By contrast changing an API which affects all packages spread across multiple repositories, means making a separate commit in everyone of those affected repositories.
Microservices and Monorepos
Although it seems counterintuitive at first, using a monolithic source control approach can be very powerful if you’re building a microservices based infrastructure. If you’re maintaining all of your microservices in one, and their consumers in another repository or if you’re maintaining both, the microservices and their consumers in one big repository, is a matter of taste. If all the consumers are controlled by one company or entity, it can be very beneficial if the consumers are maintained in the same repository together with the microservices which are powering them. On the other hand if you’re maintaining a public API you might think about maintaining all of it’s components in a monorepo apart from your own consumers.
Why not Monorepo?
Although substantial improvements can be made when using a Monorepo approach to manage a large amount of loosely coupled packages and modules, there also are some potential downsides associated with monolithic source control approaches.
Onboarding new developers can become harder because they are immediately confronted with a huge codebase instead of discovering smaller codebases one by one.
Very large scale Monorepos have to deal with technical limitations of certain source control systems. Depending on the system you’re using, there might be performance issues with repositories exceeding multiple gigabytes in size. Big companies like Facebook and Twitter have managed to workaround those issues but this might be a problem not so easily solved by a small to medium sized company.
Most source control systems at this time are not very well suited for the requirements of very big monolithic repositories. Managing access control and restricting the access to certain parts of the codebase, might be hard or even impossible to implement.
Another issue might be the integration of a Monorepo into an existing build process. Building and testing the entire codebase can take a long time when working with a giant Monorepo. But there are ways to move around this limitation by providing the tools to test and build only certain parts of the codebase.
When using a Monorepo approach there is one source of truth: every employee in a company or every contributor of an open source project is always on the same page.
Code can be easily shared and reused over multiple projects, which can be a problem too if it’s done poorly, but usually it’s a good thing to reuse as much code as possible.
Large scale refactoring is very easy — changing an API which is affecting multiple parts of the codebase, can be done with one commit or one pull request instead of having to touch multiple repositories for doing basically the same change over and over again.
Collaboration across teams is easier — bugs which affect different projects, can be fixed by one person or team instead of having to wait on multiple other teams to fix the same bug in their codebase.
Case study: avalanche
avalanche is a modular SASS framework — modular means all of it’s features are split into standalone packages. For example the CSS grid is a standalone package, or button styles can be loaded from a standalone package.
With the new 4.0 major release coming up, I wanted to make some huge changes and I quickly realized, that a major refactoring would be very much work to do because of the many repositories I had to touch for things like changing naming conventions.
Also all the tooling, like linting, testing and building would be repeated over and over again for all of those 40 repositories. For every refactoring I had to make 40 commits across all of the 40 repositories. I very quickly realised: that was not the right way to move on.
Many repos -> two repos
I started with 42 repositories and after finishing the transformation into a Monorepo, I ended up with only two repositories.
I choose to use one repository for the framework and all it’s components and a separate repository for the website. The main reason why I decided to split the projects static website generator into a separate repository is that I wanted to decouple the repository which is containing all of the packages from the code which is needed for generating the website.
│ ├── cli
│ ├── component-button
│ ├── eslint-config
│ ├── generic-box-sizing-reset
│ ├── object-aspect-ratio
│ ├── object-container
│ ├── object-grid
│ ├── ...
│ └── utility-width
Working with Lerna
Lerna is a tool that optimizes the workflow around managing multi-package Monorepos with Git and npm. Lerna was started by the people behind the Babel project to help them to optimize the workflow managing the more than 100 packages contained in the Babel Monorepo.
The best way to get started with Lerna is to run `lerna init` which gives you a basic directory structure with a packages directory and a `lerna.json` config file.
By default Lerna will use fixed versioning, you can change this to independent versioning if you decide that it better fits your needs. With fixed versioning all of the versions of all of your packages will always be the same. If you’re using independent versioning all of your packages can have different version numbers. If you want to, you can modify the packages directory, by default it is `packages`.
Importing Git repositories
When I was moving from many repos to a Monorepo, I didn’t want to lose the Git history of all of my modules during the process. Luckily Lerna also provides a command for that: With `lerna import` you can import an existing repository into the packages directory of your new Monorepo. Lerna even takes care of the paths referenced in your commit history. After importing the repository, it will look like the imported repository has always been a part of the Monorepo.
Basic Lerna workflow
After you’ve configured Lerna and after adding at least one package, you can run `lerna bootstrap`.
The Lerna bootstrap command looks at all your packages in the package directory and resolves their dependencies. If you have cross dependencies like one of your packages depends on another one of your packages, Lerna bootstrap takes care of that and creates a symlink to the dependency in the packages `node_modules` directory.
Finally, after finishing the work on your packages and you’re ready to create a new release, you can run `lerna publish`,
which publishes all updated packages to npm.
Monolithic source control does not necessarily result in monolithic software. Building modular software using a Monorepo approach is possible.
There are two basic types of Monorepos: huge repositories containing all the code maintained by a company. And project specific Monorepos like Babel, React or Symfony.
There are tools which make it easier to manage Monorepos containing multiple packages: Lerna can be used to manage Monorepos containing multiple npm packages.