Python microlibs

By Jorge Herrera

namespace-package, gitlab monorepo & tox

In our research work, we have written a lot of python code to help in various tasks and projects. Early on we identified that, as research continued, we could benefit from some way of packing this growing library in a reusable form, instead of a bunch of scripts in a plain repo.

Our first approach was to combine all this functionality into a python package, to centralize the code, avoid code duplication, and ease installation and distribution within the team. Within the package, different modules contained related functionality. Life was good for a while… but it all was about to change.

As time passed, the size and complexity of the package grew. It contained everything from DSP algorithms, database communication code, geo-processing code, NLP code, etc. It quickly grew into a beast of a package, with lots of functionality, mostly not inter-related. Being a monolithic package, it required all the dependencies to be installed — some of which are not straight forward to install — even if most of them are not required for a particular use case. This make it hard for novice users to access all the awesomeness we had packaged together. Our python code had become bloated and cumbersome to install.

Our python code had become bloated

Moreover, as time passed, we found that different people within the company could use some of this functionality. But you don’t want people having to install a DSP dependency, which in turn depends on Cython, when all they need is a way to quickly interface with our Redshift DB. Clearly the single package approach was not appropriate anymore.

Instead of a single monolithic package, it would be ideal to break it into a set of smaller packages — micro-libraries or microlibs — trying to decouple most of the functionality and their dependencies. If required, there could be inter-library dependencies, although it should be reduced to a minimum.

  1. Each microlib should be “pip-installable” by itself, only installing its own dependencies.
  2. If a microlib depends on another microlib, installing the former should automatically install the later.
  3. The code should integrate seamlessly with out continuous integration (CI) environment.
  4. The system should simplify the current process of distribution and installation, so different actors within the organization can easily use any microlib.
  5. Ideally, it should also be possible to install the whole library, including all the comprising microlibraries as a single package.

The naive approach is to simply keep each library in a single repository. But this could lead to maintainability issues. For example, if a new feature in microlib A requires a change/addition to microlib B, having them in separate repos makes it very hard to iterate and debug. It would be preferable to keep all libraries under a single repository, allowing to “see” each other without needing to pull from a separate repo, as long as in the end they can be installed independently from each other. This way, it is also possible to see if a change in a library will break something in a different library, which is not trivial with the multi repo approach.

Of course, this is not a radically new idea. In fact, Python’s namespace package concept offers a neat way to achieve some of these goals. The idea is to define a “shallow” package, whose solely purpose is to define a namespace for other sub-packages to use. Thus, microlibs are simply sub-packages that register themselves as children of the namespace defined by the top-level package. That said, we wanted everything to integrate tightly with our CI environment (we use GitLab at Shazam). In addition, we decided to make the jump and make our code python 2 and 3 compatible (it is 2017 after all!). In other to be able to test our code in both python versions, we settled on a Tox and pytest combo.

The solution I arrived to was to setup a single Git repo, using python’s namespace-package to implement non-related functionality in separate microlibs, each one as decoupled from the others as possible. To make each microlib “pip-installable” as separate packages, each one defines its own setup.py and requirements.txt files, so they can be distributed and installed independently. Each microlib has its own tests, so they can be run independently from other microlibs’ tests.

The following file hierarchy depicts the layout of the proposed solution:

root/
├ .gitlab-ci.yml
├ tox.ini
├ requirements.txt
├ setup.cfg
├ setup.py
└ microlibs/
├ foo/
│ ├ requirements.txt
│ ├ setup.py
│ ├ tox.ini
│ ├ macrolib/
│ │ └ foo/
│ │ ├ __init__.py
│ │ ├ module1.py
│ │ ├ ...
│ │ └ moduleN.py
│ └ tests/
│ ├ test1.py
│ ├ ...
│ └ testN.py
.
.
.
└ bar/
├ requirements.txt
├ setup.py
├ tox.ini
├ macrolib/
│ └ bar/
│ ├ __init__.py
│ ├ module1.py
│ ├ ...
│ └ moduleN.py
└ tests/
├ test1.py
├ ...
└ testN.py

Here, the namespace package is called macrolib. All the microlibs are kept in separate directories inside the microlibsdirectory (e.g. microlibs/foo/ and microlibs/bar/). Each microlib contains its own tests and installation and packaging configuration.

I will now briefly describe all the moving parts, and how they connect to each other.

At the core of this project is the idea of namespace-packages. From PEP-420:

Namespace packages are a mechanism for splitting a single Python package across multiple directories on disk.

This PEP introduces a new way of implicitly defining namespace packages, starting with Python 3.3. The usual approach for regular packages is to place all the code in a directory — traditionally named the same as the package — although this is not required. Inside this directory you must place a __init__.py file. This is all you need to define a package. The __init__.py can be used to define what is visible from the package, version and author information, etc., but it can also be left empty. Sub-directories under the root become modules of the package, as long as the also contain a __init__.pyfile. For implicit namespace packages, under PEP 420, you still need the root (namespace) directory, but the __init__.py is not required anymore. Inside the namespace directory, you create another directory with the actual package (microlib) code.

Like any other package, the standard way to package and distribute a microlib is by defining a setup.py file. This is a big topic in itself, but I'll leave that to you. Without diving into the details, at the very least you need to declare the microlib's name (including the namespace package, e.g. name='macrolib.foo') and the dependencies via install_requires, as arguments to thesetup() call inside the setup.py script. These are usually specified for any regular python package.

What is important in the case of namespace packages is to define a couple of less common named arguments. In particular you need to specify namespace_packages declaring the name of the namespace (e.g. macrolib) and packages, declaring the name of the microlib, including the namespace package.

In the microlib depends on one of the other microlibs, simply add it as another regular dependency in the install_requires list, making sure to use the full name (e.g. macrolib.foo).

Here is en example setup.py:

from setuptools import setup
microlib_name = 'macrolib.foo'
setup(
name=microlib_name,
version="0.1.0",
author="yourname",
author_email="yourname@email.com",
description="Your microlib descriton",
license="TBD",
classifiers=[
'Private :: Do Not Upload to pypi server',
],
namespace_packages=['macrolib'],
packages=[microlib_name],
install_requires=[
'future',
'six',
'macrolib.bar',
# add more packages if needed
],
)

As a side note, the classifiers argument included in the example is a hacky way of preventing accidental publication of the package to a public pypi repository.

Before diving into the details, you might find yourself confused on why the need of a requirements.txt file, if we already specified the dependencies in setup.py. In fact, a very common practice in the python community is to only use one of the two; I used to do that as well. If you are confused like I was, you should read this excellent article, that clearly explains the difference between abstract and concrete dependencies, why they are needed, and more importantly, how to use it correctly.

One you have read that article, you will understand why each microlib defines its own requirements.txt in addition to the install_requires argument in setup.py. In most cases the requirements.txt file will be as simple as

--index-url https://pypi.python.org/simple
-e .

If a microlib depends on another microlib, you’ll need to add a line with the relative path to the dependency. For example, if macrolib.foo depends on macrolib.bar, then foo's requirements.txt should be:

--index-url https://pypi.python.org/simple
../bar/
-e .

With this in place, it will be possible to run the tests for all the microlibs, for example on CI upon pushing to the remote, or locally to check that your changes won’t break anything.

While the macrolib package is simply a namespace, it’s setup.py file is slightly more involved than the microlibs' ones. It must "declare" all the microlibs so they are all installed when installing the macrolib in the dev environment and also in CI. In other words, for each microlib, it must specify is name and where the source code can be found.

Here is an example of the macrolib setup.py:

import os
import pip
from six import iteritems
from setuptools import setup
from setuptools.command.develop import develop
from setuptools.command.install import install
PACKAGE_NAME = 'macrolib'
SOURCES = {
'macrolib.foo': 'microlibs/foo',
'macrolib.bar': 'microlibs/bar',
}
def install_microlibs(sources, develop=False):
""" Use pip to install all microlibraries. """
print("installing all microlibs in {} mode".format(
"development" if develop else "normal"))
wd = os.getcwd()
for k, v in iteritems(sources):
try:
os.chdir(os.path.join(wd, v.root_dir))
if develop:
pip.main(['install', '-e', '.'])
else:
pip.main(['install', '.'])
except Exception as e:
print("Oops, something went wrong installing", k)
print(e)
finally:
os.chdir(wd)
class DevelopCmd(develop):
""" Add custom steps for the develop command """
def run(self):
install_microlibs(SOURCES, develop=True)
develop.run(self)
class InstallCmd(install):
""" Add custom steps for the install command """
def run(self):
install_microlibs(SOURCES, develop=False)
install.run(self)
setup(
name=PACKAGE_NAME,
version="0.1.0",
author="yourname",
author_email="yourname@email.com",
description="Macrolib's description",
license="TBD",
classifiers=[
'Private :: Do Not Upload to pypi server',
],
install_requires=[
'future',
'six',
],
cmdclass={
'install': InstallCmd,
'develop': DevelopCmd,
},
)

Looks complicated? I’ll walk you through it. At the top, I created a dict of microlibs, mapping their name to its directory inside the project. The function install_microlibs does just that: it installs all the microlibs. Depending on the command issued, they will be installed in normal mode (pip install .) or in development mode (pip install -e .); the classes DevelopCmd and InstallCmd take care of that (note that they also need to be added to the cmdclassargument in setup().

The setup.py presented above only is meant to be used in dev and CI environments. It won't be able to create a distribution (e.g. a wheel) for the complete macrolib, as it doesn't include the source code nor the dependencies of the microlibs. More on this later.

For testing we’ve been using pytest lately, as it is the default for Tox. The tests layout is fairly standard. Each microlib contain a tests/ folder, with all its unit tests in it. As usual, to be "discoverable", tests must by named test_*.py or *_test.py. If desired, they can be organized into subfolders.

In a regular package setting, it is not necessary to “install” the package in order to run the tests, as long as the layout adheres to the standard practices, so pytest can discover it. But in this case, the layout of the project make it necessary to install the microlibs so the tests can access them. While it is possible to do this using your system python, it highly recommended to use a virtual environment manager such as virtualenv or better virtualenvwrapper. While you should know how to use virtualenv, we’ll be using Tox to help us automate the process. But before going into that, you can run tests manually (and locally), by activating a virtualenv and installing the microlibs by running pip install -e . inside an active virtualenv. Then run the tests:

python -m pytest --color=yes microlibs -s

Or if you want to run tests only for a single microlib:

python -m pytest --color=yes microlibs/foo -s

Instead of having to manually manage your virtual environments, Tox help us to automate that task and more! In addition to automating the creation/removal of virtual envs, it also allows to easily test on different python versions. For maximum flexibility, I decided to place individual tox.ini files in each microlib, as well as another tox.ini at the root directory of the project, to test the complete macrolib.

A minimal tox.ini file looks like this:

[tox]
envlist = py27,py36
[testenv]
deps =
pytest
pytest-cov
commands =
pip install -e .
python -m pytest --color=yes microlibs -s

Here we declare that we want to test the code on python 2.7 and on python 3.6 (you could add more if you wish). Tox will create virtual environments for the specified python versions, and in each environment will first install the dependencies and the execute the commands. With that file in place, you can run the test by simply executing

tox

I frequently pass some extra flags such as -r (to force the recreation of the virtual env, and make sure there’s nothing ) and -e X (to override the envlist declared in tox.ini and run only environment X)

NOTE: at the moment of writing this post, there is a bug in the version of pip installed by tox for python 3.6, that makes it crash with namespace packages. If you run into this bug, you will see a trace ending in AttributeError: '_NamespacePath' object has no attribute 'sort'. A local workaround is to manually activate the py36 environment and update pip by running easy_install pip. Unfortunately this work around is hard to replicate in CI, so I have momentarily disable python 3.6 testing in CI, but I do run it locally, so it should be okay.

With all the previous steps in place, the only missing part is CI integration. GitLab offers a fairly flexible system that we can easily configure to integrate the proposed solution. The CI is defined in a file named .gitlab-ci.yml, which has to be placed at the root of the repo. A simple .gitlab-ci.yml could look like this

before_script:
- pip install tox wheel
# define an "anchor" to reuse the definition in every OS we test for
.full_test_def: &full_test
type: test
script:
- tox -r
build:osx:
tags:
- osx
<<: *full_test
build:centos7:
tags:
- centos
<<: *full_test

This example tests the entire code base on every commit in 2 different OS. The tests are run by Tox and, in this case, I decided to use -r to recreate the virtual environments every time.

It is possible (and recommended) to add extra steps to handle tags and releases automatically, but that goes beyond the scope of this post.

The topic of python packaging is not trivial. You can read more about the current state here and here. Regarding distribution formats, there has been confusion and lots of changes over the years. In particular, both eggs and wheels are popular ways of packaging for distribution. The interested reader should refer to the official documentation for more details. Being 2017 (did I mention that already?), we will package our macrolib and microlibs using wheels.

Building a wheel for the microlibs is trivial. Simply go to the microlib’s folder (e.g. cd microlibs/foo) and run

python setup.py bdist_wheel --universal=1

This should create a dist/ directory with the wheel in it. You can pip-install this wheel directly, or you can upload it to a PyPi server, either the official PyPi server or a private one (I’ll cover how to do that shortly).

The --universal=1 flag is not mandatory, but highly recommended. If the microlibs where written to be compatible with python 2 and 3 without changes (e.g. by using six, future, etc), then this will create a single wheel that can be installed in any python version.

About releasing macrolib as a package

While the above is essentially all you need to create the wheel for macrolib, there one caveat. The way we wrote macrolib’s setup.py, allows for easy development and testing by automating the installation of the microlibs in our dev and CI environments. But if we run python setup.py bdist_wheel as is, the macrolib won't generate a wheel that can install all the microlibs.

I tried 2 different options to make this work:

  1. Use the packages and package_dir arguments in the macrolib's setup() call. This way, the wheel will include the source code for all the microlibs. But there are two downsides to this alternative: a) it won't automatically include the dependencies of the microlibs, as they were declared in each microlib's setup.py script; and b) we would be distributing a "copy" of every microlib, not the microlib itself that will be uploaded as standalone microlib. These issues made this approach a "no go" for me.
  2. The second alternative is to alter the install_requires in the macrolib's setup() call, to include all the microlibs (e.g. install_requires = ['macrolib.foo', 'macrolib.bar',]. The downside of this is that it relies on microlibs being available to download from a PyPi server. But that is not really a biggie, as that is the plan anyways. To make this happen, I made my a macrolib release step in my CI that appends these lines to a setup.cfg:
[metadata]
requires-dist =
six
future
macrolib.foo
macrolib.bar

Note that you will need to also include any other dependency specified in the original setup() call, as setup.cfg will override those.

To be honest, I’m not entirely happy with this solution to release the complete macrolib, but it was the less complicated I could think of.

If your library is meant to be public, so it can be installed/used by anyone, then you should probably publish it in PyPi Python’s official package index. If, on the other hand, your library is only meant for restricted access, then the recommended way is to setup a custom python package repository and publish to that one instead. In our case we host a custom repository using pypiserver, but there are other libraries available, if you have custom needs.

There are several ways to publish to pypiserver; the simplest one uses setuptools. Here I assume that the package repo is served at http://<REPO_URL>:<REPO_PORT> (you can get a list of the hosted packages at http://<REPO_URL>:<REPO_PORT>/simple).

To publish the microlib macrolib.foo simply run:

cd root/microlibs/foo/
python setup.py bdist_wheel upload -r http://<REPO_URL>:<REPO_PORT>

Depending on what server you are publishing to, you might need to register and authenticate.

Now users with access to the package repo can pip-install the microlibs simply by running:

pip install -i http://<REPO_URL>:<REPO_PORT>/simple --trusted-host <REPO_URL> <PACKAGE_NAME>

where <PACKAGE_NAME> is the full dot separated name including the namespace (e.g. macrolib.foo). Instead of typing these options flags every the time, it is recommended to create a config file (e.g. ~/.pip/pip.conf in OS X):

[global]
extra-index-url = http://REPO_URL:REPO_PORT/simple/
trusted-host = REPO_URL

With the config file in place, you can simply run

pip install macrolib.foo

or

pip install macrolib

if you want to install the whole enchilada.

Although I’ve been coding in python for a while, and I have dealt with distributing packages before, I’ve never faced a challenge with this many moving parts. Needless to say, this is only one of the many possible solutions, the one I found best served our needs. But I’m sure there are other (possibly better) alternatives people can think of. I am sure there’s plenty of room from improvement in the proposed one. If you have some tips or advice, please leave a comment.


Originally published at medium.com on April 12, 2017.