How to build reliable environments, share internal libraries, and migrate major versions in a growing organization.
Managing dependencies is hard. Managing 3rd-party dependencies, which might not share your interpretation of semver, is even harder. Doing this with many internal libraries, each with their own dependency constraints, each maintained by independent programming teams, makes for a collection of fun puzzles.
In the Beginning
Consider the starter approach that a small monolithic application might take. This could be a web app built on flask, or a command-line tool built on click. In the beginning, the application specifies its dependencies in
requirements.txt. A minimal
requirements.txt could look like:
This setup would not last very long. As release cadence increases, the risk of pulling in a breaking version of some updated package becomes intolerable. The programming team immediately succumbs to the temptation to pin all dependencies, only to run into pip#988, as well as other diamond dependency problems.
We should just pin everything.
— every Flask developer, 3 weeks into it
# with pinned versions of immediate dependencies
Furthermore, notice that Flask depends on
jinja>=2.10. Without an upper bound on
jinja, the latest version of
jinja will always be pulled during a rebuild of the environment. There is no guarantee that even after pinning the versions of immediate dependencies, these packages would not pull in breaking versions of transitive dependencies.
Building Reproducible Environments
At this point, it becomes apparent that we need to start using lock files to enumerate and pin all transitive dependencies — the python’s equivalent of Gemfile.lock in Ruby, or Cargo.lock in Rust.
In Python, we can use pip-tools:
$ pip-compile --output-file requirements.txt requirements.in
or the newer pipenv:
$ pipenv lock # newer alternative
The produced lock file is committed to SCM and contains the pinned versions of all immediate and transitive dependencies. In our example above, the application’s lock file would look like:
The application also needs some reasonable distribution mechanism that packages the environment with the application code e.g. docker images, debian/rpm packages¹. At this point, we have a versioned artifact that
- contains the entire application,
- pins all dependencies required to run the application,
- is reproducible from source control, and
- can be deployed onto multiple hosts.
This is the minimum required to create an environment that is reproducible in CI and in production. We call this the monolith.
Sharing Internal Libraries
The above approach would work for a monolithic application. Soon, the organization grows and develops more applications and (micro)services. With the right APIs, sharing internal libraries can be very useful, not just for code reuse, but to ensure implementation consistency. Predictably, some internal libraries emerge.
Don’t Repeat Yourself.
— Some C.S. sophomore
We surmise that:
- These applications and libraries would probably live within a monorepo.
- The applications would probably depend on internal libraries at HEAD of trunk.
- Each application would run from its own (virtual) environment, but internal libraries can be installed in more than 1 application.
We went from having 1 dependency management problem to having n dependency management problems. Consider this conundrum: how can an internal library update its dependencies? Would this have to be done in lock-step, which forces us to update all the applications all at once?
Managing Multiple Applications
One possible approach to avoid lock-step upgrades is to require all internal libraries to support a sufficiently wide range of versions of its 3rd-party dependencies. The organization can enforce, using a linter, that all internal libraries use the compatible release notation
~= instead of
== when specifying 3rd-party dependencies.
For example, suppose some internal library depends on
requests. This library may declare its dependency on
requests like so:
When the maintainer of said internal library wishes to upgrade the version of
2.19, they can work with each application to upgrade them one at a time.
pipenv, this can be done using:
$ pipenv update requests # updates Pipfile.lock
$ pipenv lock --keep-outdated
~=2.10, some applications can stay on the older version of
requests, while others get upgraded. The maintainer does not have to upgrade all applications at once. The compatible release notation buys us this flexibility, and both
pipenv will accept that the modified lock files meet all the specified version constraints.
Once all applications have been upgraded, the increased functionality that requires
requests>=2.19 can be introduced into the internal library. Our intrepid maintainer then updates the library’s dependencies to reflect this fact:
# Pipfile example
requests = '~=2.19'
With this approach, a conservative maintainer would also run the test suite of the internal library against each version of
requests that they are supporting in the interim.
Maybe we shouldn’t add that dependency.
— C.S. sophomore from above, now a Tech Lead
If a 3rd-party library breaks API compatibility within the same major version, we are compelled to add a upper bound on the version constraint e.g.
~=2.10,<=2.19. When that happens, a lock-step upgrade will be required to move into the breaking version, unless the internal library is able to provide an adapter. Welcome to software development.
Migrating Major Versions with Shading
Occasionally, maintainers of 3rd-party libraries break compatibility with a new major version. The most popular libraries can be deeply embedded across multiple applications and internal libraries, often in many call sites within each, making the migration tedious. Moreover, without a full understanding of all the breaking changes, upgrading in one fell swoop can be risky.
At Affirm, we have successfully borrowed the shading technique from maven to carry out one such migration. By cloning the 3rd-party library, renaming it, and releasing it to an internal Artifactory registry, we were able to allow 2 versions of the library to coexist in the same environment.
This shading technique has 2 advantages:
- the migration work can be farmed out to the various maintainers of applications and internal libraries which use
luigi, since we no longer have to upgrade all at once; and
- each chunk of migration is easy to roll back, and with each incremental migration, the risk decreases as we accumulate experience.
Unfortunately, there is one important drawback: should the 3rd-party library have a large set of transitive dependencies, it may be necessary to shade all of them to obtain a viable installation of both major versions in the same environment. This is not always feasible.
Solving Dependency Puzzles
Managing dependencies is hard, especially in a fast moving organization with multiple products and applications. While we strive to keep our dependency trees as small as we can, some 3rd-party dependencies are unavoidable. With some smart rules and workflows, it is possible to construct something sane without spending all your time sorting out dependency conflicts.
If you know of other/better techniques for managing dependencies in a mono-repo with >100 developers, I would love to hear about them.
¹on deb/rpm: note that virtualenvs are not quite relocatable.
- Yehuda Katz’s explanation of gem vs app, and gemspec vs Gemfile vs Gemfile.lock circa 2010.
- Donald Stufft’s analog of the above, for Python: setup.py vs requirements.txt.
- Kenneth Ritz’s talk at PyCon 2018: The Future of Python Dependency Management, where he introduced pipenv.
- From The Cargo Book for Rust: Why do binaries have Cargo.lock in version control, but not libraries?