The last year or so has seen a significant increase in people discussing, and using the monorepo pattern for source control of projects. I suspect this is directly related to the number of ‘big tech’ companies, such as Facebook and Twitter who actively use, tweet and blog about their successes with the pattern. As everyone knows, anything the big players are using naturally gains momentum across the broader development community. I’ve worked with both mono- and multi-repo projects, so I though it would be interesting to see the pros and cons of each, and to discuss my experiences with both.
TL;DR; both patterns have pros and cons, and neither is a silver bullet. You should understand these benefits and limitations, and use them to come to an informed decision on what’s best for you and your project.
The Monorepo Pattern
At it’s core the monorepo pattern is extremely simple; create a single source control repository, and put all of the code for the project in that single repo. During the years where we were building (manually generally) and deploying (again, generally manually) monolithic applications I’d say this was the de facto way of working. So much so that it didn’t really have a name, it was just “using version control”. It made sense to work this way, the VCS pattern matching that of the deployed software. Everything was relatively easy to find, as it was all in one place, and one clone of the repo gave you everything you needed to get working on the project. It also promoted code sharing within the project, as there was great access and visibility of all of the code, regardless of who or which team had built it.
The Multirepo Pattern
As we moved from large monoliths into what we’d consider ‘more modern’ architectures for our software, generally featuring collections of smaller and more independent modules, the monorepo became a less obvious VCS pattern. Certainly it was still used, but creating a collection of modules also suggested the “repo per module”, or multirepo pattern might be a good option. This pattern allowed each repo to be treated separately, so if you had a mix of languages and teams, this was great. As CI and then CD became more pervasive there were also significant advantages, as you could have different process per repo.
The Pros and Cons
From here on in, I’m going to be expressing my experiences and thoughts on these two patterns, specifically when using them to work on the kinds of projects I now work on. I won’t say ‘microservices’, or ‘service-oriented’, but more broadly on applications that comprise of a number of discrete modules loosely coupled together to form the complete system, and often with different people working on each of the modules. Also, consider that this is my experience, your mileage may vary!
First, lets take the monorepo. It’s clearly got simplicity on it side, everything’s in one place, you’re not building and configuring lots of repos and ultimately you’ll always know what repo that bit of code you urgently need to fix is in. It also means that you can have a single commit and pull request covering changes in multiple modules which can be a significant bonus when performing wide ranging updates and refactors. Dependency management is also much simpler, as everything checks out to a single place, so you can reference everything using relative paths.
However, this ‘everything in one place’ also leads to what I’ve experience to be the monorepo’s weaknesses. On large projects, everything in one place creates a mighty big pile! That can mean long clone times when starting work on a project for the first time, and then builds blowing all the space on a your PC. When you have a CI/CD process in place (you have that, don’t you?) it can also lead to long build times and complex deployments as ‘out of the box’ you’ll be building and deploying the whole repo for every change.
And that’s the key failing for the monorepo at the moment, I think — the tooling isn’t ready yet. Generally our tooling works at repo level, and once you have a monorepo of any significant size you’re really going to want more granularity than that. You want CI/CD pipeline to only build and deploy modules impacted by a commit or tag rather than everything in the repo. You’ll probably want to version modules separately, rather than labeling the whole repo for every module you release. Maybe you’ll need security so teams can only see and work on certain modules, or simply you only want to clone the specific module you actually want to work on. Right now most of the generally used tools won’t easily or cleanly allow you to do that.
The ‘big’ monorepo users have gotten around this by building their own optimized tool chains, such as Google Piper VCS, and the Pants, Buck and Bazel build systems used by Twitter, Facebook and Google respectively. I’m sure we’ll end up with a well established and fully featured open source or SaaS tool set in the next couple of years, but for now I don’t feel that it’s there. That ultimately leads to one major consideration, do you want to wpend your time working around your tool set’s limitations, or worse yet writing your own tools to get a monorepo working just how you want? Does the increased effort may off for you and your project?
I’m sure we’ll end up with a well established and fully featured open source or SaaS tool set in the next couple of years, but for now I don’t feel that it’s there
Experiences and Thoughts
Based on my experience of running small and large projects using both patterns, I think monorepos are a great way to get up and running quickly, especially with smaller teams working across the entire code base. However, I think they quickly run into problems as the number of teams grows, and the code base gets larger. You’ll almost certainly run into a number of the issues discussed above, and wind up with a couple of choices
- Live with it, and accept the pain
- Build a custom tool set optimized for your process
- Refactor you mono-repo into a number of smaller repos
In my experience, living with it is pretty common in the first place. You’ll find that at almost every sprint retro (you’re doing them, right?) someone will raise that the CI pipeline is taking ages to run, or that doing a local build is at least to cups of coffee now, and generally everyone will feel like they’re getting less productive. So long term, probably not a winner!
That leaves you with two options; build your own tooling, or refactor. Building your own tooling is pretty extravagant; it’s not really your raison d’etra, so it’s a resource and cost drain on your project, and means you won’t be releasing the changes your customers and product owners want as fast as you could be.
Refactoring may take some time, but if you’ve done a good job of structuring your code and projects, it’s probably simpler and less disruptive. Sure, you’ll lose atomic commits and your developers won’t always have the complete code base, but you’re developing loosely coupled modules with well defined boundaries and interfaces (you are, aren’t you?) and so really that shouldn’t be too much of a loss. Code sharing requires a little more work, generally implementing some form of private package manager, such as an NPM server, or using Artifactory, will help with that and as an additional benefit help with additional decoupling of code bases. Switching to managed and versioned packages, with good readme files, and then promoting an active internal open-source culture is a liberating approach which I’ve found successful over the last couple of years. It also makes truly open-sourcing your modules much easier if you ever decide to go that route for some of your code.
One thing you’ll notice, is that I deliberately mentioned refactoring into a number of smaller repos. These may in fact be ‘semi-monorepos’ themselves, for example you might split each bounded-context into a repo for example, but that repo may contain code for several closely related modules. So, in effect each team winds up with a monorepo and gains the befits that brings, with less of the pain that having the entire codebase in one repo would bring.
What I can say from bitter experience is that an un-automated multirepo for a monolithic application is probably a highway to hell! I worked on one of these back in the late 90’s, a big C++ project in Visual Studio 1.5, and it was a total nightmare. Building the app (manually, we didn’t do CI back then) first meant building a bunch of DLL’s, those DLL’s requiring a bunch of libraries, which in turn required more libraries etc. So, you often needed to build 20–30 projects in exactly the right order to get your executable. We got pretty close to wheeling developers out of the office in straight jackets. The VCS in use on the project was PVCS (urg), so in the end I wound up creating a database of dependencies, a recursive algorithm to be create a build tree, and then some hacky C++ code to do the checkouts and builds in order. It worked, and probably saved some folks sanity, but in hind sight it clearly demonstrated that we really needed a monorepo… which we did eventually build.
In closing, I think right now making the right choice of pattern is pretty important. But with projects like Lerna starting to allow a monorepo to be used in a multirepo like fashion for some functions, it’s clear that the monorepo is here to stay, and that in a couple of years this choice may be much different as the associated tooling and techniques fills the current gap.