Creating Reusable CI/CD Pipelines with GitLab
CI/CD is vital to any project. It enables developers to spend more time adding new features to a project and less time worrying about how to implement those features into the main version of the code. However, CI/CD pipelines vary depending on the type of project in development.
In this article, I’ll walk through how we built reusable CI/CD pipelines for PyPi projects with GitLab CI. This article assumes that you have some knowledge of Docker, CI/CD, and Git.
What is PyPi?
PyPi or Python Package Index is a repository of publicly available Python libraries that can be downloaded via pip. Commonly used Python libraries downloaded from PyPi include Pandas, Numpy, Pytest, etc.
What is GitLab?
GitLab is primarily a web-based version control tool that provides many additional features, including CI/CD support and a Docker registry.
CI/CD in GitLab
The only requirements to implement GitLab CI/CD with a project are:
- A project hosted in a Git repository. As long as the project has version control configured, you can set up your CI/CD in GitLab.
- A
.gitlab-ci.yml
file at the root of your project directory. There can only be one file called.gitlab-ci.yml
in a project..gitlab-ci.yml
contains the project-specific test and deployment scripts that you need in your pipeline. An individual script used in a.gitlab-ci.yml
is called a job. There can be multiple jobs in a CI/CD pipeline.
3. A runner configured in GitLab. A GitLab runner is a VM instance that runs the jobs in .gitlab-ci.yml
.
Reusable Pipelines
GitLab CI allows developers to use external YAML configuration files in their own project’s .gitlab-ci.yml
. The way to do this is with an include
keyword in .gitlab-ci.yml
. This feature allows projects to use multiple external and/or local YAML files in their own YAML configuration.
I plan to implement this feature. I will build a pipeline in two separate repositories, then any developer who wants to implement pipelines in their PyPi project can do so by adding an include
command in their project’s .gitlab-ci.yml
.
In the above example, a repository contains a PyPi project that has a .gitlab-ci.yml
file at the root. Instead of making a custom CI/CD pipeline, this project utilizes a CI/CD pipelines from two external repositories. The first pipeline contains a Docker build job that creates a Docker image. The second pipeline runs the project’s tests and deploys the project’s Python library to PyPi. These external repository CI/CD pipelines can be used for any PyPi project.
I have a PyPi project configured to use external CI/CD pipelines. It’s now time to create the external CI/CD pipelines.
Creating the Docker Build Pipeline
This pipeline will contain a build job that creates a Docker image. Every repository in GitLab comes with a private Docker container registry. This is where the user’s custom Docker image will be stored. This build job will create the Docker image used in the second pipeline to run tests and deploy the project’s PyPi package.
In order to use the Docker build job, the user needs a Dockerfile
at the root of their PyPi project.
This sample Dockerfile
has a run
command that installs all dependencies from a project’s requirements.txt
. It also includes a command that installs the additional dependencies that are required in the PyPi pipeline.
With this Dockerfile
, a user can now run the job that builds a Docker image.
This job builds a Docker image and pushes the image to the GitLab CI Docker container registry. Notice the rules
— changes
keywords at the bottom of the file. This rule ensures that the Docker build job only runs when there are changes to the Dockerfile
or requirements.txt
of the user’s project. The stage of the job is defined as .pre
. The .pre
stage ensures that when this job is triggered, it is the first job to run in any workflow.
There is also an important IMAGE_TAG
variable that is used as the name of the created Docker image. The $CI_REGISTRY_IMAGE
is a predefined environment variable in GitLab CI. The value of the variable is the path to the project repository.
This job creates a Docker image with the following naming convention: <path/to/your/project>:latest
<path/to/your/project>
comes from the predefined $CI_REGISTRY_IMAGE
environment variable.
Now, all a user has to do is implement this job into their .gitlab-ci.yml
. With the Docker build job ready to go, it’s time to create a pipeline that runs tests and deploys a PyPi package.
The PyPi CI/CD Pipeline
I want the following jobs to run when the PyPi CI/CD pipeline is triggered:
- Unit Tests
- Static Code Analysis
A. Lint Testing
B. Security Testing
C. Documentation Testing
Testing is a critical component of continuous integration in any project. Any time a feature branch is merged into the master branch, there is always a chance that the new updates will break the existing code. To ensure the new features won’t break the master branch, I am going to include both unit test and static code analysis jobs in my CI/CD pipeline. I want these tests to run every time someone commits code to any branch or if a merge request is opened.
- Dev PyPi Deploy Package
- Prod PyPi Deploy Package
These last two jobs are specific to my PyPi package pipeline. After the test jobs run and pass, these jobs will execute and deploy the project’s PyPi package depending on the context of the commit or merge request.
The above pypi-pipeline.gitlab-ci.yml
file is comprised of multiple jobs. Each individual job completes one task.
Notice that every job has four keywords:
stage
: defines the when in the workflow the job runsimage
: defines the Docker image to usescript
: the shell script that is executedrules
: determine if the job is run
The stages
keyword is at the top of the file. It defines workflow of the jobs. In this YAML, the unittest
stage runs first. The unittest
stage is configured to run the unittest
job. If the unittest
job fails, the pipeline fails and no further stages in the workflow are run. However, if the unittest
job passes, the coverage_test
stage runs the coverage_test
job. This workflow runs until all jobs pass or one job fails.
This YAML file only scratches the surface of what is possible with GitLab CI. Check out the GitLab CI Docs for more examples and keywords.
The main features in this PyPi specific pipeline:
The pipeline uses the custom Docker image that is built in the separate Docker build pipeline. This Docker image comes preinstalled with the necessary dependencies. If the jobs in this pipeline used a base Python Docker image, dependencies would have to be installed for each job. This would significantly increase the time it takes for the pipeline to run. Because the jobs use a custom Docker image with the dependencies already installed, the jobs don’t need to be configured to install the project dependencies.
The pipeline jobs have rules attached to them. This ensures that jobs are not run every time the pipeline triggers. Instead, there are specific rules attached to each job that triggers the job to run.
The test jobs run anytime there is a commit or merge request regardless of the branches involved. The $CI_COMMIT_BRANCH
and $CI_MERGE_REQUEST
predefined environment variables ensure this happens.
The dev_deploy
job only runs when there is an open merge request. This dev version of the pypi package is completely separate from and in no way impacts the prod version.
The prod_deploy
job only runs when there is a commit to the master
branch. I have this rule in place because users will only want to deploy PyPi packages that are ready for public use.
Now that the pipeline is complete, it’s time to use it in a PyPi project.
Using the Pipelines in the Dataframez Project
I showed this .gitlab-ci.yml
configuration file above. This is the .gitlab-ci.yml
for the dataframez project. Dataframez is an extension to the Pandas library and is deployed to PyPi. The dataframez project is a perfect fit for the pipelines I created.
For the external PyPi pipeline to work, dataframez needs:
- requirements.txt
- Dockerfile (this can be a copy of the one I created earlier)
- a
.gitlab-ci.yml
that usesinclude
statements in order to run the external pipelines - The project needs to be configured to use a GitLab runner
- Project environment variables configured to connect to PyPi via
twine
The above files are placed at the root of the project.
This .gitlab-ci.yml
uses two includes YAML files from two repositories.
The first is the Docker build YAML that builds a Docker image from the dataframez project’s Dockerfile
.
The second is the PyPi pipeline that runs the dataframez unit tests and static code analysis tests. It then deploys the dataframez package to PyPi if the rules are met (a commit to the master
branch).
The dataframez .gitlab-ci.yml
combines the two external pipelines into one. Notice that this file defines a variable called IMAGE_NAME
which equals $CI_REGISTRY_IMAGE
. The Docker build job uses $CI_REGISTRY_IMAGE
to name the created Docker image.
IMAGE_NAME
is used in the PyPi pipeline as the Docker image that runs the jobs in the workflow. The workflow of the pipeline is configured so the Docker build job runs first if the Dockerfile
or requirements.txt
are updated. This ensures that the PyPi pipeline is always using the most updated Docker image.
Now when any branch in the dataframez project is committed to, the CI/CD pipeline will get triggered to run.
Closing Thoughts
While this PyPi pipeline is just one of many reusable pipelines that can be created for nearly any type of project, the baseline process of how to build reusable pipelines is relatively similar. GitLab CI holds many more avenues to explore. Hopefully, I was able to give you a solid overview of how CI/CD pipelines work in GitLab.
Ready to accelerate your digital transformation?
At Hashmap, we work with our clients to build better, together.
If you’d like additional assistance in this area, Hashmap offers a range of enablement workshops and consulting service packages as part of our consulting service offerings, and would be glad to work through your specifics in this area.
Feel free to share on other channels and be sure and keep up with all new content from Hashmap here. To listen in on a casual conversation about all things data engineering and the cloud, check out Hashmap’s podcast Hashmap on Tap as well on Spotify, Apple, Google, and other popular streaming apps.
Other Tools and Content You Might Like
Sam Kohlleffel is in Hashmap’s RTE Internship program developing data and cloud applications and is also a student at Texas A&M University studying Economics, Math, and Statistics.