The Problem with Shared Code

This is the first of a 4-part series on the best practices for organizing code repositories.

  • Part 1: The Problem with Shared Code (this article)
  • Part 2: How the Best Tech Companies Organize Code (coming soon)
  • Part 3: Monorepo vs Multirepo Survey Results (coming soon, but fill out the survey here)
  • Part 4: Repo Decision Framework (coming soon)

In this first article we define the problem and understand the challenges involved with sharing code.


Earlier this year, my company GetHuman raised money and grew from 2 engineers to 8. The good thing about this is that we were able to start writing a lot more code. The bad thing about this is that we started writing a lot more code.

The Problem

Back when it was just my co-founder and me, there were very few formal development processes. We just hacked away and pushed into production at will. It worked because we both knew every line of code. It was easy to coordinate changes because we were essentially single threaded on most projects.

That all changed after we hired 6 more developers and the new team started cranking. Not only were there more commits on a daily basis, but the types of commits became more diverse and complex. The interesting thing is that while our overall output increased, the productivity of each individual developer actually decreased. The primary reason for this can be boiled down to one simple statement:

Sharing code efficiently at scale is hard.

To be clear, sharing code is not hard. There are many different ways to share code in general and most of them are relatively easy to implement. The hard part is doing it efficiently at scale which means:

  1. Multiple code modules (that share some code)
  2. Multiple team members
  3. High rate of change
  4. Little to no loss of individual productivity

When your goal is to have all four of these conditions exist at the same time, you will almost certainly run into some challenges.

Scalability Challenges

Over the past year, most of the issues we had scaling our codebase fall into one of the following categories:

  • Refactoring
  • Versioning
  • Testing
  • Reviewing
  • Consistency
  • Deploying
  • Size

Each category is described in more length below.

Refactoring

Let’s say that we want to change a field in a shared data model from “isSubscriber” to “subscriberFlag”. You could just publish that one line code change and let any code using that data model break or you can try to update all references to “isSubscriber” at the same time throughout your code base. Many code editors are good at doing this, but all the code must be loaded in the editor at the same time.

This can be a major source of pain as your codebase grows larger. For example, at one point, we had 42 npm modules split among 15 different repos. It was nearly impossible to use Webstorm with all of this code (it would lock up on a regular basis to re-index). VS Code is better because it generally doesn’t lock up, but intellisense is often unavailable (I suspect because of a similar need to constantly re-index code in the background).

Versioning

Whenever a downstream dependency is updated, you have a choice:

  1. Let the upstream modules upgrade at some point in the future
  2. Force all upstream modules to upgrade immediately

In either case, there is some overhead. With option 1, you must maintain multiple versions of the downstream dependency. With option 2, you need to update and test all upstream modules right away.

Perhaps an even bigger issue with versioning, however, is when you need to go back in time. This can be a huge (i.e. nearly impossible) challenge if your codebase is split among dozens of different repos and you need to figure out where and when a bug was introduced into the system.

Testing (for JavaScript)

In general, integration and end-to-end (e2e) tests can be tricky. For JavaScript, the challenge with testing multiple local dependencies can be best summarized by this well known JavaScript adage:

“npm linking sucks, yo”

Theoretically you should be able to npm link all your local dependencies while you make changes. In practice, however, this rarely works well when you have a large number of interdependent npm modules. The overhead involved is non-trivial (ex. see how long it takes you to npm link 3+ modules together) and a wide variety of tricky issues crop up on a seemingly regular basis (i.e. like this nasty TypeScript issue or this annoying Angular issue or this Babel issue or this Webpack issue or this other Webpack issue…you get the point).

Reviewing

We always try to keep each Pull Request (PR) as small and specific as possible. However, sometimes even small, targeted updates require changes across several different repos. Each repo involved requires the overhead of another PR. On top of that, it is hard for reviewers to understand the full scope of the change since they have to look across all the PRs involved.

There is also a question of ownership. Who is responsible for reviewing a particular PR? At scale it is not feasible to have everyone review every PR. There are actually two levels of issues here.

  1. First, how do you set up a system where the right people can review the right PRs at the right time?
  2. Second, how do you ensure that whatever system you setup to answer the first question doesn’t create an untenable matrix of gatekeepers who grind productivity to a halt?

Individual repo owners acting as gatekeepers can be a real problem in some organizations. When I worked at Wells Fargo, the overhead involved with convincing another team to allow us to make a particular change in “their” repo often made us think twice about making the change in the first place.

Consistency

Somewhat related to the review process is how to best keep a consistent set of standards across the entire code base. This is pretty easy with 2 or 3 developers when everyone can review every change, but what about when there are dozens of developers (or more) and they push in hundreds of changes (or more) every day?

Deploying

While we started off with many different smaller repos, we experienced some of the issues described above and at one point we tried to consolidate everything into one larger repo. We quickly ran into a major issue, however. It turns out that a large majority of the most popular deployment tools are completely incapable of effectively handling multiple disparate deployment configurations. They all expect exactly one configuration file. For example:

CircleCI

TravisCI

Codeship

You can, of course, deploy multiple packages, but they must all be part of the same configuration. This is a problem if you want to independently manage different deployable projects in one repo.

Size

Another major challenge with trying to use one repo for everything is size. The common wisdom with hardware is that it is generally easier in the short term to scale vertically (i.e. add more memory or more CPU), but it is better in the long term to scale horizontally (i.e. add more machines). This same logic holds true for code repositories as well.

The size of any given repo does not generally matter until you get to the point where day-to-day operations are affected. Some of the signs that your repo may be getting to big include:

  • Cloning taking minutes to complete and sometimes fails
  • Merging taking too long
  • Repo not able to fit on some developer laptops

In general, dealing with a large repo can be troublesome at best and will only get worse over time unless steps are taken to deal with the size (more on that in Part 2 of this series).

OK, so what do we do about all this?

I am fairly certain that almost all professional software developers have encountered at least some of the issues described in this article at some point in their careers. Many of these developers do truly care about finding good solutions so that they can share code efficiently at scale.

The weird thing, however, is that as important and prevalent as this issue is, there are not a lot of great resources out there. There are no doubt many articles written about specific solutions at specific companies, but I have found almost all of them lacking in the detail and flexibility needed to implement the solution I need for my current situation.

So, I have two very specific goals for this 4 part blog post series:

  1. Create a decision framework that anyone and everyone can use to help them figure out which solution(s) would work best for their current situation.
  2. Identify specific tooling and implementation details for each potential solution.

I don’t think that there is one perfect solution that will work for everyone, but I do think that many different companies have figured out solutions that work well for them. The key is to coalesce this knowledge into one cohesive and digestible package. That is exactly what I hope to accomplish with the next part in this blog post series, “How the Best Tech Companies Organize Code”.


I haven’t published the other 3 articles yet, but once I do, I will update this article with links:

  • Part 1: The Problem with Shared Code (this article)
  • Part 2: How the Best Tech Companies Organize Code (coming soon)
  • Part 3: Monorepo vs Multirepo Survey Results (coming soon, but fill out the survey here)
  • Part 4: Repo Decision Framework (coming soon)

In the meantime, please add comments below and/or hit me up on Twitter with your thoughts and opinions on this topic.

Like what you read? Give Jeff Whelpley a round of applause.

From a quick cheer to a standing ovation, clap to show how much you enjoyed this story.