Sitemap
DevOps with Valentine

DevOps can help you deliver more business value. It is not always easy but certainly possible.

How to Cache Python Dependencies in GitLab CI/CD

7 min readJan 29, 2025

--

This blog post discusses handling Python dependencies with pip. It DOES NOT address other tools such as Poetry (with pyproject.toml) or pip-tools.

💡 If you are looking for the QUICK SOLUTION, scroll all the way to the end. If you want to UNDERSTAND, read the whole thing.

The status quo

Let’s start with a basic project with some dependencies stored in the requirements.txt file:

Flask==3.1.0
langchain==0.3.16
scipy==1.15.1

The following pipeline uses caches but fails to render the performance boost that caches promise:

default:
image: python:3-slim
variables:
PIP_CACHE_DIR: ".cache/pip/"
build:
stage: build
cache:
key: mycache
- .cache/pip/
script:
- pip install -r requirements.txt
test:
stage: test
cache:
key: mycache
paths:
- .cache/pip
script:
- pip install -r requirements.txt

But why? Let’s troubleshoot this.

  1. Are caches being created? A quick inspection of the logs in the build job will confirm that they are indeed created.

2. Are the caches being downloaded and used? A quick inspection of the logs in the test job will confirm that they are indeed downloaded.

Not only that, but it is clear that the cache is also being used:

3. Is there an improvement in the job execution speed? If we compare the duration of the script step in each job, we will notice that it is practically identical to the duration before introducing caches. So no improvement!

Even worse, we have actually made the pipeline slower, as we now have the overhead of uploading and downloading the cache, which also takes a few seconds.

If you want to learn how to build pipelines in GitLab CI/CD, I have created an online course that starts with the basics and helps you build a CI/CD pipeline that takes a project and deployes it to the cloud. Learn more about the course.

Understanding where pip stores files

Understanding where the packages downloaded are stored plays a key role in understanding how to configure the caches.

Let’s run the following commands as part of a job. And yes, we are running pip install two times (you will see why).

- pip cache dir
- pip install -r requirements.txt
- pip install -r requirements.txt

And this is how the output of the first two commands should look like:

And finally, re-running pip install should confirm that all dependencies have already been installed. Our goal with caches in GitLab is to get the message “Requirement already satisfied” in the logs.

What is the pip cache directory?

The pip cache dir command shows us where pip stores its caches.

So what is inside the directory /root/.cache/pip ? As part of the installation process, pip downloads packages and stores them in this directory to speed up future installations. What this directory actually contains are cached HTTP responses from PyPI and some metadata, not actual packages that Python can import and use.

What is the site-packages directory?

This directory contains the actual Python modules and their dependencies. This is where Python loads installed packages from when you import them in a Python script.

Which directory do we actually need to cache?

Let’s try breaking things. I will try deleting both directories (one after the other) before running the application to see which one is a really needed. Here is a snippet with the command used:

- rm -rf /root/.cache/pip
- python app.py # This should still work
- rm -rf /usr/local/lib/python3.13/site-packages
- python app.py # This should fail

My conclusion after this small experiment is that after the installation, the application only needs the site-packages to run. So creating a cache for this directory .cache/pip makes no sense (from the perspective of the pipeline speed).

The only problem is that the site-packages is buried somewhere in the filesystem where GitLab can’t create a cache from there. Try to do so will result in the following error:

WARNING: processPath: artifact path is not a subpath of project directory

If you want to learn how to build pipelines in GitLab CI/CD, I have created an online course that starts with the basics and helps you build a CI/CD pipeline that takes a project and deployes it to the cloud. Learn more about the course.

How to cache the site-packages directory?

We need to bring the site-packages to the project directory. Here are the options:

Using the PYTHONUSERBASE environment variable

The PYTHONUSERBASE environment variable tells Python/pip where to install user-specific Python packages when using the — user flag with pip install. By setting PYTHONUSERBASE, we can redirect the installation to a custom location, like the project directory.

some_build_job:
stage: build
cache:
key: your-cache-key
paths:
- $PYTHONUSERBASE/lib/
variables:
PYTHONUSERBASE: $CI_PROJECT_DIR
script:
- pip install -r requirements.txt --user
- python app.py

The dependencies will be stored in the lib directory within the GitLab project directory. You just need to remember to append --user to all pip install commands!

Using a virtual environment (venv)

A virtual environment (venv) is an isolated environment for Python projects. For our particular needs, using a virtual environment allows us to manage dependencies separately from the system-wide Python installation and store them in the project directory.

some_build_job:
stage: build
cache:
key: your-cache-key
paths:
- .venv/lib/
before_script:
- python -m venv .venv # Create virtual environment inside the .venv directory
- source .venv/bin/activate # Activate the environment
script:
- pip install -r requirements.txt

Creating helpful content is a demanding task, consuming significant time and energy. Puting this article together took more than 10 hours of work which you can digest in 10 minutes. If you found this valuable and wish to show your support, I’d appreciate it if you could leave a comment and share this article. Don’t forget to hit that 👏 button a few times — up to 50, in fact!

Pip will still store its cache outside of the .venv directory, but we really don’t care since we have established earlier that this cache has no positive impact on the pipeline execution time.

If you want to learn how to build pipelines in GitLab CI/CD, I have created an online course that starts with the basics and helps you build a CI/CD pipeline that takes a project and deployes it to the cloud. Learn more about the course.

Personally, I find the solution with virtual environments to be the cleanest.

I think the official GitLab documentation is WRONG!

At the time of this writing, the official GitLab documentation does mention using virtual environments (but fails to explain why).

In the example given, it just uses the PIP_CACHE_DIR environment variable to change where pip stores its caches. They continue by caching only this path, .cache/pip which does not lead an improvement in performance.

Global cache configuration

You might be tempted to use a global cache configuration, like this:

default:
cache:
key: your-cache-key
paths:
- .venv/lib/

By default, each job uses a pull-push policy, meaning that EVERY job will download the cache (which is something we do want to happen) and EVERY job will upload a new cache (which is typically NOT what we want).

You can override this behavior in a job as shown in this example:

default:
cache: &global_cache
key: your-cache-key
paths:
- .venv/lib/

some_job:
cache:
<<: *global_cache # Inherit the global cache settings
policy: pull # Override the policy from pull-push (default) to pull
...

Conclusion

I hope this article helped you better understand how to configure Python caches in GitLab CI/CD so that you can actually get a faster pipeline. Leave a comment in the section below if you have any questions. I would love to hear from you!

Thank you for sticking with this article until the end. If you enjoyed it, please leave a comment, share, and press that 👏 a few times (up to 50 times). It will help others discover this information, and maybe it will help someone else as well.

Want to learn more about GitLab CI/CD?

I am not only writing here on Medium, but I am also an online instructor. So if you are interested in learning more about GitLab CI/CD, maybe you want to take a look at my Gitlab CI/CD: Pipelines and DevOps for Beginners online course.

References

--

--

DevOps with Valentine
DevOps with Valentine

Published in DevOps with Valentine

DevOps can help you deliver more business value. It is not always easy but certainly possible.

Valentin Despa
Valentin Despa

Written by Valentin Despa

Software developer, educator & overlander • GitLab Hero • AWS Community Builder • Postman Supernova • Imprint: http://vdespa.com/imprint