Global Supply Chain Security — PURL and Namespace — Emerging Conventions

Shawn Hartsock
5 min readOct 6, 2021

--

source: https://github.com/package-url/purl-spec

reference: https://github.com/package-url/purl-spec

I had a problem. We wanted to be able to specify a “software package” universally between all build and repository systems. The problem is, the concept of a “software package” is itself a context sensitive concept.

Some package formats:

  • an RPM file or a DEB file is a package and there will be one per platform or dependency combination
  • a JAR file is a package and there may be one for each byte code type or some combination of dependencies.
  • a WAR file is a package that can contain a JAR … and we can recurse again into the EAR format (but this is less common now)
  • a Ruby Gem or Python wheel/egg/pex … are also packages and they usually are flat like an RPM or DEB
  • NuGet packages, Windows MSI, Windows installer executables, …
  • GoLang packages reference Source Code repositories directly but can sub-divide the internal contents to only use portions of a package
  • … and so forth … more than I can possibly list

From this quick package format survey I can derive a few facts about package management as a problem domain in general.

  • Packages can themselves contain packages (or not) and the sub-packages may or may not stand as their own package.
  • All packages universally resolve to a single file at some point. This may or may not itself be executable and may have to vary based on compile targets, runtime dependencies, or other multitude of platform specific concerns.
  • Virtually all package management systems support the concept of mirrors. Mirrors allow the addressing of a package by its contents and not its location.

From this quick and dirty analysis it begins to feel like a good universal package location system would exhibit some properties. Maybe I can drop some of these … maybe I have yet to identify all the issues to solve.

  • content not location based specification … I don’t care if you get this from the definitive source or not as long as it is unaltered from that source.
  • some sister-service for determining tamper evidence in a package file or files once they are obtained … once I have some content name … I need some way to use that name to ask: is this the same content as what the publisher intended?
  • some mechanism for handling dependency mapping and recursion where packages contain packages … and this itself must be able to use the universal package locator / name convention recursively to at least a limited extent.
  • it should deterministically translate from one package management systems identification scheme to another package management systems scheme without conflict since the identifier itself should be universal
  • any package name and identifier system should be able to handle the complexity of package content variation based on dependencies like target system byte-code, dependency versions, runtimes, etc.

I stumbled upon the Package URL spec a few weeks ago. I was trying to solve a related package naming and identification problem. I had conjectured that the URL/URI/IRI concepts were an appropriate fit. It looks like the pURL project founders agreed.

The URL part is interchangeable with International Resource Locator, International Resource Identifier, Universal Resource Identifier, and so on.

The pURL specification leans fairly heavily on the URL/URI specification in RFC1738 which allows for multiple arbitrary schema, optional credentials, namespaces, optional ports, paths, and importantly optional field and value pairs.

This capacity for a URL to be extended with additional field and value pairs means that a specific sub-syntax can specify how to itemize target architectures like arch=i386 or arch=amd64 but arbitrarily any specific query into however a package management system might need to sub-divide a package.

If we look at the existing specification under discussion we can see that the spec fairly generously encompasses a number of real world package management systems as seen here:
https://github.com/package-url/purl-spec

Roughly the specification looks like:

color highlights to mark out delimiter characters, the section “field=value&” can be repeated indefinitely

Reproducing a few choice examples that I personally think makes the case for a PURL or PIRI standard fairly succinctly:

pkg:bitbucket/birkenfeld/pygments-main@244fd47e07d1014f0aed9c
pkg:deb/debian/curl@7.50.3–1?arch=i386&distro=jessie
pkg:gem/jruby-launcher@1.1.2?platform=java
pkg:github/package-url/purl-spec@244fd47e07d1004f0aed9c
pkg:golang/google.golang.org/genproto#googleapis/api/annotations
pkg:maven/org.apache.xmlgraphics/batik-anim@1.9.1?packaging=sources
pkg:npm/%40angular/animation@12.3.1
pkg:nuget/EnterpriseLibrary.Common@6.0.1304
pkg:pypi/django@1.11.1
pkg:rpm/opensuse/curl@7.56.1–1.1.?arch=i386&distro=opensuse-tumbleweed

Possible extensions to PURL or PIRI as it gains adoption:

  • the specification at minimum creates a convention that once memorized is easily re-usable in other contexts
  • if we embrace and extend the concept of schema and type for data schema and data type we can represent many different types of information about a package beyond just the name itself.

Some possible examples:

/myservice/pkg/nuget/EnterpriseLibrary.Common/6.0.1304

/myservice/pkg/rpm/opensuse/curl/7.56.1–1.1.?arch=i386&distro=opensuse-tumbleweed

/myservice/sigstore/cosign/opensuse/curl/7.56.1–1.1

/myservice/sbom/spdx/opensuse/curl/7.56.1–1.1.?arch=i386&distro=opensuse-tumbleweed

/myservice/src/git-mirror/opensuse/curl/7.56.1–1.1.?arch=i386&distro=opensuse-tumbleweed

The convention is powerful. I now have an intuitive way to start addressing queries about packages without having to learn a specific domain’s API. I do have a few things I have to just sort of memorize or off-handedly discover.

  • service name and location … specific to my local company setup
  • the “schema” here means “data schema” and that list of names is somewhat arbitrary … I’d have to discover it
  • the “type” is also “artifact type” meaning rpm, deb, and now extended to package signatures, sBOM formats and even “src” for source code that my company may have mirrored for me.
  • but wait … what about the central bits … namespace and given name?

pURL and other identifier systems like:
https://datatracker.ietf.org/doc/html/rfc4122#section-4.3

Do NOT appear to spell out this idea of “namespace” and how it _could_ be managed … this appears to be related to the idea of a DNS name as seen in a URL. The problem with DNS is that the IP address (IPv4 or IPv6 doesn’t matter) has no intrinsic relationship with the DNS name entry.

What’s worse, DNS is itself a whole problem domain. The name for an IP address and the IP address for a name are decided upon by governmental and business conventions. DNS names are bought and sold as property.

Is there any guidance w.r.t. “namespace” from the PURL group? From other groups? I still need to find out.

The cheapest idea would appear to be just to copy the DNS name and map that into the root of the pURL namespace … which is what companies do in the Maven ecosystem without much conflict.

A company should internally decide on:

product-line.example.com

… or …

com.example.product-line

And the beauty of the DNS name analogy is that it is fairly unambiguous which is why Java adopted it. It serves most business-to-business cases reasonably well since in that context these are companies that will have bought their DNS name and can re-use it in the packaging context.

And the namespace can be extended indefinitely with the addition of more “.” delimited name elements. An example of fairly arbitrary subdivision

unit.part.product-line.example.com

Did I just invent a Product Name Service? (I hope not.) But, the problem seems to extend into the realm of arbitrary mappings that are decided not on some empirical data but rather on subjective group consensus.

Where/when/how should we have these discussion more broadly? For now, just using whatever names your current package managers use is fairly uncontroversial. And, perhaps this is a problem to solve once we actually have it?

--

--

Shawn Hartsock

software engineer — at a hybrid cloud computing company