Private Python Package Index with Zero Hassle

This blog post is about how we host our own private python package index with zero hassle, or in other words — in a simple, secure, cost-effective and maintenance-free manner.

We are also excited to announce that we have open sourced pypiprivate, which is a utility for privately publishing python packages and an important component of our solution.

Background

Although Clojure is our primary programming language at Helpshift, Python has always been there in some capacity right from the beginning. We’ve been using Python for all kinds of use cases, from developer tooling to critical production operations such as release automation, monitoring, backups and disaster recovery. It’s also the preferred language of our data scientists and we’ve been using it extensively for building our Machine learning platform since past 2 years.

With increase in it’s adoption across the organization, we felt a need to enable easy code reuse in Python too, similar to how we have for Clojure. A typical way of reusing code is to group generic abstractions and common utilities into libraries. These libraries can then be included as dependencies in multiple projects. However, not all of our libraries are open source, thus pypi.org isn’t the right place to publish them. In short, we required a way to distribute private libraries securely, not just with developers but also with our automated build and deployment systems.

Existing tools

We started by evaluating existing tools that Python ecosystem had to offer, Devpi and pypiserver being the most prominent ones. These tools provide a web application to serve the package index over HTTP and an interface (either web or CLI or both) for publishing packages to the repository, plus additional features such as index search for discoverability, multiple indexes, mirroring etc.

Sounds pretty straightforward right? Unfortunately, that’s hardly ever the case when it comes to running anything in production. To build a reliable system, it’s important to think about high availability, monitoring & alerts, backups & disaster recovery and so on. As a result, the final solution ends up becoming much more complex than originally intended. Moreover, tools such as devpi come with a lot of features that are rarely useful and mostly unnecessary.

Hence, setting up a multi-node production-ready system for hosting private packages felt like an overkill to us considering the infrastructure and maintenance costs.

PEP-503: Simple Repository API

Fortunately, the requirements for hosting a pypi-compatible repository and index are minimum and clearly specified in PEP-503. All you need to do is store the package artifacts in a certain directory structure, generate an index for the files and put both the repository (ie. package artifacts) and the index behind any web server that can serve static files.

As per the specification, the directory structure looks something like this,

/simple
|-- foo
| |-- foo-0.1.0-py2-none-any.whl
| |-- foo-0.1.0-py3-none-any.whl
| |-- foo-0.2.0-py2-none-any.whl
| |-- foo-0.2.0-py3-none-any.whl
| `-- index.html
|-- index.html
`-- bar
|-- index.html
|-- bar-0.3.1-py2-none-any.whl
|-- bar-0.3.1-py3-none-any.whl
`-- bar-0.3.1.tar.gz
2 directories, 16 files

Once the directory structure is ready, you can setup a web server in front of it to serve the package files and index over HTTP(S).

Now, users can install private packages from this index by specifying --extra-index-url option with the pip install command as follows,

$ pip install mypackage --extra-index-url=https://my.pypi.com/simple

Securing access

The above setup will give us a working package index. However, if you remember, we started with the goal of keeping the packages secure and private.

For that it helps to think about two types of users who will need different levels of access to the repository and index,

  1. Publishers: Will publish packages to the repository when new versions are released. They will need read-write access to the repository. Ideally, your release automation system will be the publisher. Additionally, there could also be human release managers who are authorized to publish packages.
  2. Consumers: Will fetch and install packages for development or production use. These will be developers plus the build automation system. They will need read-only access to the package index served over HTTP.

Our setup

Our setup using S3, nginx and a custom package publishing tool — pypiprivate

In our setup, we use the following components:

Amazon S3

Package artifacts and indexes are stored in an S3 bucket which is protected by IAM credentials.

Pypiprivate

For copying files to S3 as per the required directory structure (or more precisely, an s3 key hierarchy) and generating the index, we built our own tool — pypiprivate, as there wasn’t an existing one that met our requirements. You can install pypiprivate from pypi.org and read it’s documentation on github.

At present, pypiprivate supports S3 and local file system as storage backends but it can be easily extended to support others such as Azure Blob Storage, Google Cloud Storage etc.

Nginx

We use nginx as the reverse proxy server sitting in front of the S3 bucket.

For secure access to S3, we’ve configured the AWS Auth Plugin through which nginx authenticates with Amazon using read-only credentials. At the same time, HTTP Basic Authentication is configured to provide secure access to the index over HTTP(S). Our nginx conf looks something like this,

server {
    server_name my.pypi.com;
    ...
    auth_basic "Restricted";
auth_basic_user_file /etc/nginx/.htpasswd;
    location /simple {
# serve the index if request ends with a slash
rewrite ^(.*)\/$ $1/index.html break;
        aws_access_key "**************";
aws_secret_key "**************";
s3_bucket mypypibucket;
        proxy_set_header Authorization $s3_auth_token;
proxy_set_header x-amz-date $aws_date;
proxy_pass https://mypypibucket.s3.amazonaws.com;
}
}

For installing from the private repository, HTTP basic auth credentials can be specified in the pip install command as follows,

$ pip install mypackage --extra-index-url=https://<username>:<password>@my.pypi.com/simple

What about the “zero hassle” part?

As you can see, by not running an application server we’ve managed to side step a lot of complexity. If we’d have been running an application server such as devpi or pypyserver, then high availability and monitoring would have been our additional concerns.

Nginx cluster being already a part of our infra, HA and monitoring for that was already in place.

S3 as the storage is highly available, cheap and replicates data on multiple devices across availability zones by default. This spares us from worrying about backup and recovery.

Conclusion

We’ve been using this private package index for quite a few months. It’s a simple solution that covers our basic use case and is hassle-free for us to maintain. At the same time, there is definitely a trade-off in terms of missing all the cool features that sophisticated solutions such as devpi offer.

The purpose of this post was mainly to share the approach that has worked well for us. You may use it to host your own private package repository and index, adapting it the cloud provider and web server of your choice.