Avoid Public PyPI Using Google Cloud Artifact Registry

Paul Balm
Google Cloud - Community
10 min readAug 23, 2022

Why would you want to avoid PyPI? PyPI is great, it’s a massive bazaar with close to 400,000 packages, and publicly available since 2003. But unfortunately, it also has very limited control over what is published. It’s easy for people with bad intent to publish packages with very similar names to existing popular packages, to take advantage of common typo’s, and include malware. It’s also possible to find packages that the owner no longer has time to look after, and to attempt to hack the owner’s account and publish a new poisoned version of a valid existing package. There are many variations exploiting these software supply-chain vulnerabilities, but we can also protect ourselves, by avoiding software repositories over which we have no control.

Larger organizations will often go further, prohibiting access to the public internet outright. Now if you need to install a Python package, what do you do? If you’re on Google Cloud, you use Artifact Registry.

A cartoon of a cloud with a lock on it.

There is a lot of information about setting up Python repositories using Artifact Registry, typically to share internally developed Python packages. But the specific challenges for those in an environment without internet access are rarely covered. This post provides the detail on how to overcome those obstacles.

If you set up a private package repo, as we suggest in this post, you will then have to maintain it by continuously installing new versions of packages, to allow you users to stay up to date. This burden will be diminished when Artifact Registry makes its virtual repositories generally available, but until then, we will have to maintain by hand.

Setting up a Python Index on Artifact Registry

Your concerned administrator will lock down your internet access, for example by disallowing you from creating resources with an external IP address, such as virtual machines in Compute Engine, or Jupyter notebooks in Vertex AI Workbench. If the network (the VPC) of the resource does not have Cloud NAT enabled, it will not have internet access.

In this situation, when we need to install Python packages, we’re going to install them from a Python repository in Artifact Registry. We start by selecting Python packages and versions that we trust and that we will allow in our project. We will upload these packages to Artifact Registry, and the users of the project can install them when they need them.

But if we don’t have internet access, how can we get the packages to upload them to Artifact Registry? To set up our Python package repository, we will need internet access from somewhere. We could use a virtual machine in Compute Engine, that an administrator is allowed to create with an external IP address, or a VM that is created in a VPC that has Cloud NAT, to enable the internet access. In this case, we will have to block other users creating resources in this VPC, of course!

For the remainder of this post, we will assume that we are working on a virtual machine in the secured project, that has access to the public internet and to the Artifact Registry in our project. We will be using the gcloud command-line interface.

In this situation we can simply follow the documentation to create a Python repository and upload our packages:

  1. Start by creating a Python repository called “medium” in the region of your choice:
gcloud auth login
gcloud config set project my_project
gcloud artifacts repositories create medium \
--repository-format=python --location=europe-west1

2. Often the list of required Python packages for a project is kept in a requirements.txt file. If you don’t have that, then you can create a requirements.txt using pip freeze:

pip freeze > requirements

3. Ready to download all our packages! Or perhaps we should first check that our list doesn’t contain any known vulnerabilities… I recommend using pip-audit. If it comes back with “no vulnerabilities found”, then we are good to go! You use pip-audit as follows:

pip install virtualenv
virtualenv venv
source venv/bin/activate
pip install --upgrade pip-audit
pip-audit --requirement requirements.txt

4. Ready to download all our packages, finally. The pip tool is helpful here:

mkdir dist
pip download --destination-directory dist -r ../requirements.txt

5. And now we upload the downloaded packages using twine, as outlined in the Google Cloud documentation (don’t forget to replace “my_project” with the name of your project!)

pip install twine
twine upload --repository-url https://europe-west1-python.pkg.dev/my_project/medium/ dist/* --skip-existing

If twine hangs on a large number of packages to upload, upload them one by one:

for F in dist/*; do twine upload — repository-url https://europe-west1-python.pkg.dev/my_project/medium/ $F — skip-existing; done

Congratulations, your Python repository is live!

Using your private Python package repository

The documentation, and many blogs and stackoverflow questions, will tell you to do two things: To set up authentication and to configure pip to retrieve packages from your Python repository in Artifact Registry. Let’s look at these two steps.

A cartoon of a traffic light on yellow.

Setting up authentication

Before installing packages from your repository in Artifact Registry, using pip install, you need to enable pip to authenticate with Artifact Registry. Pip is not a Google Cloud aware tool, so it needs some mechanism to retrieve the Google Cloud credentials. This mechanism is the “keyring” package with Google Cloud Artifact Registry plug-in. You install the keyring and the plug-in using pip install.

But wait… If I don’t have internet access and I can’t authenticate with my private repository yet, how do I install this keyring with the plug-in? The snake bites its own tail.

The easiest way

The easiest solution to this problem is to have an administrator build a custom container. If this container build process has access to the internet, then you can just pip install the packages as part of the Docker build:

RUN pip install keyring==23.7.0
RUN pip install keyrings.google-artifactregistry-auth==1.0.0

Note that this step does involve installing these packages from the public PyPI, which is what we’re trying to avoid. But we need to download these packages at some point, and as long as it’s controlled by an administrator, we can, for instance, unpublish the container if any contamination of these packages would have occurred.

If the easiest way doesn’t apply

The easiest way involves building a custom image and using internet access during the container build process. If you build your container using Cloud Build, then the Cloud Build workers must have internet access. If any of these two conditions are violated, we need to look for another solution.

The solution is for the administrator to download the packages, as we have done previously with the packages from our requirements.txt file. We install the downloaded files using pip, which will enable the authentication with Artifact Registry.

We just need to install two packages: keyring and keyrings.google-artifactregistry-auth. But we are also going to need their dependencies. And the dependencies of the dependencies. Not to mention, their dependencies. You get the point. Pip download will take care of downloading everything. These are 22 packages in total, that you need to install in the right order. For current versions, keyring 23.7.0 and keyrings.google_artifactregistry_auth 1.0.0, this is the list, in a valid order:

jeepney-0.8.0-py3-none-any.whl
typing_extensions-4.3.0-py3-none-any.whl
zipp-3.8.1-py3-none-any.whl
importlib_metadata-4.12.0-py3-none-any.whl
pycparser-2.21-py2.py3-none-any.whl
cffi-1.15.1-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
cryptography-37.0.4-cp36-abi3-manylinux_2_24_x86_64.whl
SecretStorage-3.3.2-py3-none-any.whl
keyring-23.7.0-py3-none-any.whl
pyasn1–0.4.8-py2.py3-none-any.whl
pyasn1_modules-0.2.8-py2.py3-none-any.whl
rsa-4.9-py3-none-any.whl
cachetools-5.2.0-py3-none-any.whl
six-1.16.0-py2.py3-none-any.whl
google_auth-2.9.1-py2.py3-none-any.whl
certifi-2022.6.15-py3-none-any.whl
charset_normalizer-2.1.0-py3-none-any.whl
idna-3.3-py3-none-any.whl
urllib3–1.26.11-py2.py3-none-any.whl
requests-2.28.1-py3-none-any.whl
pluggy-1.0.0-py2.py3-none-any.whl
keyrings.google_artifactregistry_auth-1.0.0-py3-none-any.whl

The administrator can share these packages via Cloud Storage, and whenever needed, the user can retrieve and install them. But this is a bit of a hassle, and if we are running any kind of distributed processing (using Dataflow, for example), we would need to go through this process every time a new worker starts up, which will affect processing speed. It would be better to generate a docker image, and use this image to start up new workers. I won’t describe that in detail here, but if you follow for example Using custom containers in Dataflow, you would download all the packages to a directory (called pydist for instance) and include this in your Dockerfile:

ADD pydist pydist
RUN pip install pydist/jeepney-0.8.0-py3-none-any.whl
RUN pip install pydist/typing_extensions-4.3.0-py3-none-any.whl
RUN pip install pydist/zipp-3.8.1-py3-none-any.whl
...

And so on.

Once we have these packages installed, we can run keyring — list-backends and it should output something like this:

keyrings.gauth.GooglePythonAuth (priority: 9)
keyring.backends.chainer.ChainerBackend (priority: -1)
keyring.backends.fail.Keyring (priority: 0)

And this means we are all set up. If you’re not authenticated already, use gcloud auth login to get your credentials. Now, if you run pip install, it will be able to authenticate with your repository Artifact Registry. That is, if pip has the configuration to know where your repository is. We’ll see how that’s done in a moment, but let’s reflect for a moment.

A custom container with Everything We Need

If we’re going to have to download 22 packages and possibly build a custom container, so that we can install packages from Artifact Registry… Why don’t we just create a custom container that has all the packages we need, so that we don’t need to install anything?

Building a custom container with all the code required is recommended practice from a performance point of view, and for reproducibility. Performance, because there is no installation work required when a container starts up. And reproducibility because the software versions on the container are fixed, whereas a requirements.txt file often leaves (some) package versions unspecified.

But to be able to do this, we’re going to need to know the complete list of packages that we’re going to need. When we find out we need to upgrade a package, we need to rebuild the container. Same when we need to add a package that we weren’t using before. If you have a Dataflow pipeline, or similar, that is working in production, then we would still recommend creating a custom container for this process. But if your team is developing new code, the management overhead of having to build a new container is going to be very high. For example, a notebook instance would have to be recreated on top of a newly built container, every time you want to install a Python package. So Artifact Registry makes the management of your Python packages a lot easier.

When you need to use a package that no one in your project has used before, your administrator just downloads it from PyPI, verifies that it has no vulnerabilities (!), and uploads it to Artifact Registry.

Configure pip!

You can configure pip by passing the right command-line arguments on every call, but this is rather error-prone. The Artifact Registry UI has a button for “Setup Instructions”

Google Cloud Console showing the UI for Artifact Registry, which has a button labeled “Setup Instructions”.

If you click it, it will tell you to run this command (again, replace “my_project” with your project name)

gcloud artifacts print-settings python \
--project=my_project --repository=medium --location=europe-west1

The output of this command is some configuration that you can paste into your pip configuration file. It will tell pip to add an “extra index URL”. This will cause pip to look at the public PyPI and also at your local repository. This is useful, so you can publish your own packages internally and you don’t lose access to PyPI for everything else. Unfortunately, it doesn’t work for us.

If you don’t have access to the internet, pip will hang on trying to check PyPI. We don’t need to add an extra index URL, we need to set the index URL and tell pip not to look anywhere else. This would be the configuration for “my_project”, which also disable pip’s update check:

[global]
index-url = https://europe-west1-python.pkg.dev/my_project/medium/simple/
[global]
disable-pip-version-check = True

Store this information in a file called pip.conf and tell pip to use it:

export PIP_CONFIG_FILE=pip.conf

Use an absolute path if you want this to work from any directory. If you were building a custom container, then you’re also going to want to include this configuration in your Dockerfile:

COPY pip.conf pip.conf
ENV PIP_CONFIG_FILE=pip.conf

If you’re not building a custom container that has this configuration set up, then these steps have to be done every time after the container starts up. If you’re using a Vertex AI User Managed Notebook, then you can automate this using a “post-start-up script” for your notebook instance. You need to store the script in Cloud Storage and pass the Cloud Storage URL to the –post-startup-script argument, for instance:

gcloud notebooks instances create example-instance \
--vm-image-project=deeplearning-platform-release \
--vm-image-family=caffe1-latest-cpu-experimental \
--machine-type=n1-standard-4 \
--location=us-central1-b \
--post-startup-script=gs://…

You can now install packages using pip install as you would normally:

pip install packagename

Look for the first line of the output to verify that pip is looking at the correct repository:

Looking in indexes: https://europe-west1-python.pkg.dev/my_project/medium/simple/

Conclusion

If you have followed this far, then you are done. You now have configured pip to query your private Python repository in Artifact Registry. You have also configured it to be able to authenticate with the Artifact Registry, without ever being able to go out to the public internet!

The awareness of the security implications of relying on public repositories for software is growing. We hope this guide will help you configure a secure environment on Google Cloud.

Cartoon: The sun comes up over a green field.

References

  1. Delivering software securely, Google Cloud whitepaper
  2. Artifact Registy documentation
  3. Creating virtual repositories on Artifact Registry, Artifact Registry documentation. Feature in private preview as of August 2022.
  4. Managing Python packages, Artifact Registry documentation
  5. How to find third-party vulnerabilities in your Python code, Red Hat. Instructions for pip-audit.

Image credit: Denise Balm

--

--

Paul Balm
Google Cloud - Community

I’m a Strategic Cloud Engineer with Google Cloud, based in Madrid, Spain. I focus on Data & Analytics and ML Ops.