So you want to write a package manager

Package management is awful, you should quit right now

  1. software is terrible
  2. people are terrible
  3. there are too many different scenarios
  4. nothing will really work for sure
  5. it’s provable that nothing will really work for sure
  6. our lives are meaningless perturbations in a swirling vortex of chaos and entropy

LOLWUT is “Package Manager”

  • OS/system package manager (SPM): this is not why we are here today
  • Language package manager (LPM): an interactive tool (e.g., `go get`) that can retrieve and build specified packages of source code for a particular language. Bad ones dump the fetched source code into a global, unversioned pool (GOPATH), then cackle maniacally at your fervent hope that the cumulative state of that pool makes coherent sense.
  • Project/application dependency manager (PDM): an interactive system for managing the source code dependencies of a single project in a particular language. That means specifying, retrieving, updating, arranging on disk, and removing sets of dependent source code, in such a way that collective coherency is maintained beyond the termination of any single command. Its output — which is precisely reproducible — is a self-contained source tree that acts as the input to a compiler or interpreter. You might think of it as “compiler, phase zero.”
  1. divine, from the myriad possible shapes of and disorder around real software in development, the set of immediate dependencies the developer intends to rely on, then
  2. transform that intention into a precise, recursively-explored list of source code dependencies, such that anyone — the developer, a different developer, a build system, a user — can
  3. create/reproduce the dependency source tree from that list, thereby
  4. creating an isolated, self-contained artifact of project + dependencies that can be input to a compiler/interpreter.

We Have Met The Enemy, And They Are Us

  • I have some unit of software I’m creating or updating — a “project.” While working on that project, it is my center, my home in the FLOSS world, and anchors all other considerations that follow.
  • I have a sense of what needs to be done on my project, but must assume that my understanding is, at best, incomplete.
  • I know that I/my team bear the final responsibility to ensure the project we create works as intended, regardless of the insanity that inevitably occurs upstream.
  • I know that if I try to write all the code myself, it will take more time and likely be less reliable than using battle-hardened libraries.
  • I know that relying on other peoples’ code means hitching my project to theirs, entailing at least some logistical and cognitive overhead.
  • I know that there is a limit beyond which I cannot possibly grok all code I pull in.
  • I don’t know all the software that’s out there, but I do know there’s a lot, lot more than what I know about. Some of it could be relevant, almost all of it won’t be, but searching and assessing will take time.
  • I have to prepare myself for the likelihood that most of what’s out there may be crap — or at least, I’ll experience it that way. Sorting wheat from chaff, and my own feelings from fact, will also take time.

People are going to do risky activities. Instead of saying YOU’RE WRONG TO DO THAT JUST DON’T DO THAT, we can choose to help make those activities less risky.

States and Protocols

Meet the Cast

The on-disk states that matter to a PDM
  • Project code: the source code that’s being actively developed, for which we want the PDM to manage its dependencies. Being that it’s not currently 1967, all project code is under version control. For most PDMs, project code is all the code in the repository, though it could be just a subdirectory.
  • Manifest file: a human-written file — though typically with machine help— that lists the depended-upon packages to be managed. It’s common for a manifest to also hold other project metadata, or instructions for an LPM that the PDM may be bundled with. (This must be committed, or else nothing works.)
  • Lock file: a machine-written file with all the information necessary to [re]produce the full dependency source tree. Created by transitively resolving all the dependencies from the manifest into concrete, immutable versions. (This should always get committed. Probably. Details later.)
  • Dependency code: all of the source code named in the lock file, arranged on disk such that the compiler/interpreter will find and use it as intended, but isolated so that nothing else would have a reason to mutate it. Also includes any supplemental logic that some environments may need, e.g., autoloaders. (This needn’t be committed.)

Pipelines within Pipelines

The compiler/interpreter do not know or care about the PDM. All they see is source code inputs.
Source code (and PDM actions thereupon) are on the X axis. Commit history, as organized by a version control system, is on the Y.
Project code and the manifest express the user’s intentions. The lock file and dependency source code are the PDM’s attempts to fulfill those intentions.
Don’t even pretend it’s not true

To Manifest a Manifest

Also known as, “normal development”
A slightly tweaked version of the Cargo manifest for Rust’s iron crate

Central Package Registry

[package]
version = "0.2.6"

Parameterization

Dependencies

  • Adding or removing a dependency
  • Changing the desired version of an existing dependency
  • Changes to the parameters or source types of a dependency
<tool> add <identifier>@<version>
<tool> add <identifier>

Hang on, we need to talk about versions

I know there is a limit beyond which I cannot possibly grok all code I pull in.

  • Evolutions in the software, via a well-defined ordering relationship between any two versions
  • The general readiness of a given version for public use (i.e., <1.0.0, pre-releases, alpha/beta/rc)
  • The likelihood of different classes of incompatibilities between any given pair of versions
  • Implicitly, that if there are versions, but you use a revision without a version, you may have a bad time

I have to prepare myself for the likelihood that most of what’s out there will probably be crap. Sorting wheat from chaff will also take time.

Non-Version Versions

Note that only one of these concepts can be represented with pictures from iStockPhoto. HMMM.

I know that relying on other peoples’ code means hitching my project to theirs, entailing at least some logistical and cognitive overhead.

The Unit of Exchange

$ ls -a
.
..
.git
MANIFEST
<ur source code heeere>
  • The manifest (and the lock file) take on a particularly meaningful relationship to their neighboring code. Generally, the manifest then defines a single ‘unit.’
  • It is still ABSOLUTELY NECESSARY that your unit of exchange be situated on its own timeline — and you can’t rely on the VCS anymore to provide it. No timeline, no universes; no universes, no PDM; no PDM, no sanity.
  • And remember: software is hard enough without adding a time dimension. Timeline information shouldn’t be in the source itself. Nobody wants to write real code inside a tesseract.

Other Thoughts

  • Choose a format primarily for humans, secondarily for machines: TOML or YAML, else (ugh) JSON. Such formats are declarative and stateless, which makes things simpler. Proper comments are a big plus — manifests are the home of experiments, and leaving notes for your collaborators about the what and why of said experiments can be very helpful!
  • TIMTOWTDI, at least at the PDM level, is your arch-nemesis. Automate housekeeping completely. If PDM commands that change the manifest go beyond add/remove and upgrade commands, it’s probably accidental, not essential. See if it can be expressed in terms of these commands.
  • Decide whether to have a central package registry (almost certainly yes). If so, jam as much info for the registry into the manifest as needed, as long as it in no way impedes or muddles the dependency information needed by the PDM.
  • Avoid having information in the manifest that can be unambiguously inferred from static analysis. High on the list of headaches you do not want is unresolvable disagreement between manifest and codebase. Writing the appropriate static analyzer is hard? Tough tiddlywinks. Figure it out so your users won’t have to.
  • Decide what versioning scheme to use (Probably semver, or something like it/enhancing it with a total order). It’s probably also wise to allow things outside the base scheme: maybe branch names, maybe immutable commit IDs.
  • Decide if your software will combine PDM behavior with other functionality like an LPM (probably yes). Keep any instructions necessary for that purpose cleanly separated from what the PDM needs.
  • There are other types of constraints — e.g., required minimum compiler or interpreter version — that may make sense to put in the manifest. That’s fine. Just remember, they’re secondary to the PDM’s main responsibility (though it may end up interleaving with it).
  • Decide on your unit of exchange. Make a choice appropriate for your language’s semantics, but absolutely ensure your units all have their own timelines.

The Lockdown

In which we jump the gap. Are you watching closely?

I know that I/my team bear the final responsibility to ensure the project we create works as intended, regardless of the insanity that inevitably occurs upstream.

The algorithm

  • Build a dependency graph (so: directed, acyclic, and variously labeled) by recursively following dependencies, starting from those listed in the project’s manifest
  • Select a revision that meets the constraints given in the manifest
  • If any shared dependencies are found, reconcile them with <strategy>
  • Serialize the final graph (with whatever extra per-package metadata is needed), and write it to disk. Ding ding, you have a lock file!
Our project depends directly on A and B, which depend on C, which depends on D, and E, which depends on F.
The project’s lock file should record all of A, B, C, D, E, and F.
  • The user expressly indicated to ignore the lock file
  • A floating version, like a branch, is the version specifier
  • The user is requesting an upgrade of one or more dependencies
  • The manifest changed and no longer admits them
  • Resolving a shared dependency will not allow it

Diamonds, SemVer and Bears, Oh My!

The set in blue form a happy diamond.
A broken diamond. Also noteworthy: while the happy diamond is merely a graph, this is also a tree. Y’know what else are trees? Filesystems. Do you smell a useful isomorphism? I smell a useful isomorphism.
  • Highlander: Analyze the A->C relationship and the B->C relationship to determine if A can be safely switched to use C-1.1.1, or B can be safely switched to use C-1.0.3. If not, fall back to realpolitik.
  • Realpolitik: Analyze other tagged/released versions of C to see if they can satisfy both A and B’s requirements. If not, fall back to elbow grease.
  • Elbow grease: Fork/patch C and create a custom version that meets both A and B’s needs. At least, you THINK it does. It’s probably fine. Right?
  • Phone a friend: ask the authors of A and B if they can both agree on a version of C to use. (If not, fall back to Highlander.)
  • Just because the semver ranges suggest solution[s], doesn’t mean I have to accept them.
  • A PDM tool can always further refine semver matches with static analysis (if the static analyses feasible for the language has anything useful to offer).
  • No matter which of the compromise solutions is used, I still have to do integration testing to ensure everything fits for my project’s specific needs.
  • The goal of all the compromise approaches is to pick an acceptable solution from a potentially large search space (as large as all available revisions of C). Reducing the size of that space for zero effort is beneficial, even if occasional false positives are frustrating.

Dependency Parameterization

Compiler, phase zero: Lock to Deps

All the lifting, none of the thinking

The Dance of the Four States

  • init: Create a manifest file, possibly populating it based on static analysis of the existing code.
  • add: Add the named package[s] to the manifest.
  • rm: Remove the named package[s] from the manifest. (Often omitted, because text editors exist).
  • update: Update the pinned version of package[s] in the lock file to the latest available version allowed by the manifest.
  • install: Fetch and place all dep sources listed in the lock file, first generating a lock file from the manifest if it does not exist.
Each command reads the state at the arrow’s source in order to mutate the state at the arrow’s target. add/rm might be triggered from static analysis, but typically come from the user’s knowledge of what needs doing. install implicitly creates a lock file, if needed.

A New-ish Idea: Map, Sync, Memo

  • f : P → M: To whatever extent static analysis can infer dependency identifiers or parameterization options from the project’s source code, this maps that information into the manifest. If no such static analysis is feasible, then this ‘function’ is really just manual work.
  • f : M L: Transforms the immediate, possibly-loosely-versioned dependencies listed in a project’s manifest into the complete reachable set of packages in the dependency graph, with each package pinned to a single, ideally immutable version for each package.
  • f : L D: Transforms the lock file’s list of pinned packages into source code, arranged on disk in a way the compiler/interpreter expects.
Manifest and lock are out of sync, but lock and deps are still in sync because each function is narrowly scoped
If there’s a lock file, but the dep sources don’t exist…well then, duh, they’re out of sync
  • does not exist
  • exists, but desynced from predecessor
  • exists and in sync with predecessor
All commands are still oriented towards the same type of state mutation, but they rest atop a pervasive sync operation that ensures a computable pre- and post-mutation environment. (`install` might not even be needed anymore!)

Dénouement

A PDM for Go

  • A PDM for Go is doable, needn’t be that hard, and could even integrate nicely with `go get` in the near term
  • A central Go package registry could provide lots of wins, but any immediate plan must work without one
  • Monorepos can be great for internal use, and PDMs should work within them, but monorepos without a registry are harmful for open code sharing
  • I have an action plan that will indubitably result in astounding victory and great rejoicing, but to learn about it you will need to read the bullet points at the end
  • GOPATH manipulation is a horrible, unsustainable strategy. We all know this. Fortunately, Go 1.6 adds support for the vendor directory, which opens the door to encapsulated builds and a properly project-oriented PDM.
  • Go’s linker will allow only one package per unique import path. Import path rewriting can circumvent this, but Go also has package-level variables and init functions (aka, global state and potentially non-idempotent mutation of it). These two facts make npm-style, “broken diamond” package duplication an unsound strategy for handling shared deps.
  • Without a central registry, repositories must continue acting as the unit of exchange. This intersects quite awkwardly with Go’s semantics around the directory/package relationship.
  • Approaches to package management must be dual-natured: both inward-facing (when your project is at the ‘top level’ and consuming other deps), and outward-facing (when your project is a dep being consumed by another project).
  • While there’s some appreciation of the need for harm reduction, too much focus has been on reducing harm through reproducible builds, and not enough on mitigating the risks and uncertainties developers grapple with in day-to-day work.

Upgrading `go get`

.git                 # so, this is the repo root
main.go # a main package
cmd/
bar/
main.go # another main package
foo/
foo.go # a non-main package
foo_amd64.go # arch-specific, may have extra deps
foo_test.go # tests may have extra deps
LOCKFILE # the lockfile, derp
vendor/ # `go get` puts all (LOCKFILE) deps here
  1. Walk back up from the specified subpath (e.g. `go get <repoaddr>/cmd/bar`) until a LOCKFILE is found or repository root is reached
  2. If a LOCKFILE is found, dump deps into adjacent vendor dir
  3. If no LOCKFILE is found, fall back to the historical GOPATH/default branch tip-based behavior
.git
README.md
src/
LOCKFILE
main.go
foo/
foo.go
.git
foo.go
bar/
bar.go
cmd/
LOCKFILE
main.go
$GOPATH/github.com/sdboyer/example/
.git
foo.go
bar/
bar.go
cmd/
LOCKFILE
main.go
vendor/github.com/sdboyer/example/
foo.go
bar/
bar.go
cmd/ # The PDM could omit from here on down
LOCKFILE
main.go

Sharing, Monorepos, and The Fewest “Don’t Do It”s I can manage

  • Open repositories are safe and sane to pull in as a dependency; closed repositories are not. (Just reiterating.)
  • Open repositories should have one manifest, one lock file, and one vendor directory, ideally at the repository root.
  • Open repositories should always commit their manifest and lock, but not their vendor directory.
  • Closed repositories can have multiple manifest/lock pairs, but it’ll be on you to arrange them sanely.
  • Closed repositories can safely commit their vendor directories.
  • Open or closed, you should never directly modify upstream code in a vendor directory. If you need to, fork it, edit, and pull in (or alias) your fork.

Semantic Versioning, and an Action Plan

  • Interested folks should come together to create and publish a general recommendation on what types of changes necessitate what level of semver version bumps.
  • These same interested folks could come together to write a tool that does basic static analysis of a Go project repository to surface relevant facts that might help in deciding on what semver number to initially apply. (And then…post an issue automatically with that information!)
  • If we’re feeling really frisky, we could also throw together a website to track the progress towards the semver-ification of Go-dom, as facilitated by the analyze-and-post tool. Scoreboards, while dumb, help spur collective action!
  • At the same time, we can pursue a simplest-possible case — defining a lock file, for the repository root only, that `go get` can read and transparently use, if it’s available.
  • As semver’s adoption spreads, the various community tools can experiment with different workflows and approaches to the monorepo/versioning problems, but all still conform to the same lock file.

--

--

systems | people

Love podcasts or audiobooks? Learn on the go with our new app.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store