The brief history of package distribution in Python

Kostya Nikitin
7 min readOct 17, 2018

--

pip, easy_install, distutils, distribute, wheels, eggs, dh-python — what is that all about? I have never done a packaging setup on my own, so I was curious about the details behind it. This article aims to shed some light on the package distribution landscape and highlight important steps in the tools evolution because the original “History of Packaging” page only provides only dated milestones without context.

TL;DR; If you just want to create a Python package, please, visit this guide.

Imagine, that it is 1996 and your friend has just shared a new Python library with you. How would you install it? Probably, you will copy it /to/the/disk, and then you’ll add /to/the/disk to thesys.path. Even nowadays we are doing something like that occasionally. It is hard to imagine now, but the first step towards building a packaging infrastructure was to define a common folder, where all of the libraries should be installed. This is how thesite module was created. The first version of this module (17 Aug 1996) says:

Scripts or modules that need to use site-specific (third-party) modules should place “import site” somewhere near the top of their code.

Nowadays the site module is imported automatically at the moment you start an interpreter. During the load, it expands sys.path with some predefined folders, including system and user folders, and with contents of *.pth files found in those folders.

In 1998 Greg Ward committed the first version of distutils library. It provided a uniform and easy way of:

  • packaging modules (either as a source distribution, i.e. you need to download an archive with sources and run setup.py or as a binary, executables for different platforms),
  • versioning,
  • installing packages, which also includes compiling of architecture-dependent C extensions, and a uniform way to amend the installation defaults by a broad set of command-line arguments.

It was a significant step forward. distutils created a foundation both for the pure python distributions and for projects using C-extensions. But as the time was going by, and libraries became more and more complex and started being dependent on each other. The problem with distutils library was the lack of support for dependencies between packages, which was added only 4 years later, in PEP 314.

The first attempts to deal with dependencies were based on the fact that Python allows you to overwrite __builtin__.__import__ function. On the one hand, it allows you to create a hook to collect all of the dependencies, on the other hand, it gives you an ultimate control over the importing process. Although PEP 302 is not directly related to dependencies management, it stated import hooks as an acceptable way of managing dependencies and provided some important language extensions to move things forward:

Extending the import mechanism is needed when you want to load modules that are stored in a non-standard way. Examples include modules that are bundled together in an archive; byte code that is not stored in a pyc formatted file; modules that are loaded from a database over a network.

Packaging applications for end users is a typical use case for import hooks, if not the typical use case. Distributing lots of source or pyc files around is not always appropriate (let alone a separate Python installation), so there is a frequent desire to package all needed modules in a single file. So frequent in fact that multiple solutions have been implemented over the years.

Just a short example of how it works:

import __builtin__def my_import(module_name, globals=None, locals=None, fromlist=(),
level=0):
print 'Hey, someone wants to import ' + module_name

__builtin__.__import__ = my_import
>>> import my.super.package
Hey, someone wants to import my.super.package
>>> from another.package import something
Hey, someone wants to import another.package

As one of the building blocks for your own implementation of __builtin__.__import__, you can use imp module, which allows you to locate and load file-based modules.

Ways to extend the import mechanism were quite similar, and finally, several libraries were developed to make this process easier, for example, importlib or ihooks. Those libraries allow you to intercept and customize one some interesting imports, while the default process remains untouched.

For example, pyinstaller library used this technique to collect all of the dependencies and then put them to an archive and distribute altogether. At the moment when your library is imported, a special handler modifies your importing workflow in a way, that some imports are actually reading data from that archive. The current implementation of importing workflow as defined by PEP 302 is based on the ideas from this library.

Definitely, creating fat packages was not very welcomed by everyone, and in 2004 setuptools library by Phillip Eby was released. It started as an extension of distutils and provided a new keyword to the setup function, install_requires to specify dependencies. At the moment when setuptools (or, more specifically, easy_install, the installer application which comes with setuptools) realizes that the dependency is not satisfied, it recursively starts downloading and installing missing packages. One more feature that setuptools had was its own distribution and installation format, called egg . Egg is a very self-sufficient format, it has all of the compiled files and metadata, trying to become Java’s JARs for Python.

Although setuptools had solved most of the problems with packaging, and probably it became the most popular library for packaging, it is still not part of a standard. Moreover, there were some problems with the maintenance and it was forked as a distribute package (it was eventually merged back to setuptools).

Another tool to solvesetuptools issues was pip. Founded in 2008 by Ian Bicking, it was also based on setuptools code, but has some significant improvements:

  • All packages are downloaded before installation. Partially-completed installation doesn’t occur as a result.
  • The reasons for actions are kept track of. For instance, if a package is being installed, pip keeps track of why that package was required.
  • The code is relatively concise and cohesive, making it easier to use programmatically (recently, the pip authors have been discouraging its use that way).
  • Packages don’t have to be installed as egg archives, they can be installed flat (while keeping the egg metadata).
  • Native support for other version control systems (Git, Mercurial and Bazaar).
  • Uninstallation of packages.
  • Simple to define fixed sets of requirements and reliably reproduce a set of packages via requirements.txtfile.
  • Better support for virtual environments.
  • Better maintenance.

At this point my story finishes. I’ve tracked the main steps of packaging evolution, starting from 1996 to 2008. The next section contains some more information, but I was not able to connect it with the previous parts.

As was mentioned earlier, distutils provides two ways to package libraries: source and binary. Source packages are more or less fine for the python-only code, but just imagine how many extra dependencies you have to install to the production system to be able to install, say, psycopg or numpy from the source package: compilers, linkers and lots of other tools. Also, running setup.py from the source package requires some arbitrary code execution, which would be nice to avoid for performance and security reasons.

So, the requirement to have a binary package format were still actual. Python eggs are not standard and contain .pyc files inside of the archive, which makes them less portable. As a result, in 2012 Daniel Holth suggested a new binary distribution format called wheel. It is a zip archive, containing python source files and compiled C-extension files. So, having a wheel built once, its installation becomes as simple as unzipping an archive and compiling python source files to .pyc .

At the moment when the proposal was approved, pip got a support for wheels , but easy_install authors decided to ignore this format.

If you are familiar with RPM’s, DPKG’s and other binary distribution formats provided by operating systems, you may notice that the problems that wheels are solving were already solved by those tools. And as a bridge between a source package and one of those formats some helper utilities were created, for example, dh-python , which generates a DPKG from the library sources, specifies dependencies, generates a script to compile *.py files and compiles C-extensions only when you are building DPKG.

In an absence of wheel s this was probably the most convinient way to distribute libraries. On the one hand, library maintainer is responsible only for the source distribution, and then binary package maintainers, using different set of tools are transforming those source packages to the binary formats and uploading to repositories. Users are enjoying downloading packages from the repositories and installing them using already familiar native tools.

This story is a result of some archeological research I did using git tools and release notes. To be honest, 1996 was my first year at school, and I have started my career as a Python developer in 2011. So it is possible that I’ve missed or misunderstood something, because all of the events described above happened at the moment when I literally had no ideas about the software development at all. If you have some thoughts how to improve this article or can point to mistakes in it, I will be happy to fix it.

I am very thankful to my wife Carina, who helped me to allocate time for preparing the materials and was my editor, to the members of https://t.me/strlist and distutils-sig@python.org mailing lists, especially to Vladimir Mantaskov, Eric V. Smith and Roman Shmatov.

If you are curious about improving Python packaging, you should join the Open Source event https://generalassemb.ly/education/bloomberg-open-source-weekend-pypa/london/58402 held in London at 27th-28th of October.

--

--