Creating Reusable CI/CD Pipelines with GitLab

Published in

Hashmap, an NTT DATA Company

8 min readNov 23, 2020

CI/CD is vital to any project. It enables developers to spend more time adding new features to a project and less time worrying about how to implement those features into the main version of the code. However, CI/CD pipelines vary depending on the type of project in development.

In this article, I’ll walk through how we built reusable CI/CD pipelines for PyPi projects with GitLab CI. This article assumes that you have some knowledge of Docker, CI/CD, and Git.

What is PyPi?

PyPi or Python Package Index is a repository of publicly available Python libraries that can be downloaded via pip. Commonly used Python libraries downloaded from PyPi include Pandas, Numpy, Pytest, etc.

What is GitLab?

GitLab is primarily a web-based version control tool that provides many additional features, including CI/CD support and a Docker registry.

CI/CD in GitLab

The only requirements to implement GitLab CI/CD with a project are:

A project hosted in a Git repository. As long as the project has version control configured, you can set up your CI/CD in GitLab.
A .gitlab-ci.yml file at the root of your project directory. There can only be one file called .gitlab-ci.yml in a project. .gitlab-ci.yml contains the project-specific test and deployment scripts that you need in your pipeline. An individual script used in a .gitlab-ci.yml is called a job. There can be multiple jobs in a CI/CD pipeline.

3. A runner configured in GitLab. A GitLab runner is a VM instance that runs the jobs in .gitlab-ci.yml.

Reusable Pipelines

GitLab CI allows developers to use external YAML configuration files in their own project’s .gitlab-ci.yml. The way to do this is with an include keyword in .gitlab-ci.yml. This feature allows projects to use multiple external and/or local YAML files in their own YAML configuration.

I plan to implement this feature. I will build a pipeline in two separate repositories, then any developer who wants to implement pipelines in their PyPi project can do so by adding an include command in their project’s .gitlab-ci.yml.

CI/CD pipeline that includes files from two external repositories

In the above example, a repository contains a PyPi project that has a .gitlab-ci.yml file at the root. Instead of making a custom CI/CD pipeline, this project utilizes a CI/CD pipelines from two external repositories. The first pipeline contains a Docker build job that creates a Docker image. The second pipeline runs the project’s tests and deploys the project’s Python library to PyPi. These external repository CI/CD pipelines can be used for any PyPi project.

I have a PyPi project configured to use external CI/CD pipelines. It’s now time to create the external CI/CD pipelines.

Creating the Docker Build Pipeline

This pipeline will contain a build job that creates a Docker image. Every repository in GitLab comes with a private Docker container registry. This is where the user’s custom Docker image will be stored. This build job will create the Docker image used in the second pipeline to run tests and deploy the project’s PyPi package.

In order to use the Docker build job, the user needs a Dockerfile at the root of their PyPi project.

Sample Dockerfile for PyPi projects

This sample Dockerfile has a run command that installs all dependencies from a project’s requirements.txt. It also includes a command that installs the additional dependencies that are required in the PyPi pipeline.

With this Dockerfile, a user can now run the job that builds a Docker image.

Docker build job

This job builds a Docker image and pushes the image to the GitLab CI Docker container registry. Notice the rules — changes keywords at the bottom of the file. This rule ensures that the Docker build job only runs when there are changes to the Dockerfile or requirements.txt of the user’s project. The stage of the job is defined as .pre. The .pre stage ensures that when this job is triggered, it is the first job to run in any workflow.

There is also an important IMAGE_TAG variable that is used as the name of the created Docker image. The $CI_REGISTRY_IMAGE is a predefined environment variable in GitLab CI. The value of the variable is the path to the project repository.

This job creates a Docker image with the following naming convention: <path/to/your/project>:latest

<path/to/your/project> comes from the predefined $CI_REGISTRY_IMAGE environment variable.

Now, all a user has to do is implement this job into their .gitlab-ci.yml. With the Docker build job ready to go, it’s time to create a pipeline that runs tests and deploys a PyPi package.

The PyPi CI/CD Pipeline

I want the following jobs to run when the PyPi CI/CD pipeline is triggered:

Unit Tests
Static Code Analysis
A. Lint Testing
B. Security Testing
C. Documentation Testing

Testing is a critical component of continuous integration in any project. Any time a feature branch is merged into the master branch, there is always a chance that the new updates will break the existing code. To ensure the new features won’t break the master branch, I am going to include both unit test and static code analysis jobs in my CI/CD pipeline. I want these tests to run every time someone commits code to any branch or if a merge request is opened.

Dev PyPi Deploy Package
Prod PyPi Deploy Package

These last two jobs are specific to my PyPi package pipeline. After the test jobs run and pass, these jobs will execute and deploy the project’s PyPi package depending on the context of the commit or merge request.

PyPi Pipeline YAML

The above pypi-pipeline.gitlab-ci.yml file is comprised of multiple jobs. Each individual job completes one task.

Notice that every job has four keywords:

stage: defines the when in the workflow the job runs
image: defines the Docker image to use
script: the shell script that is executed
rules: determine if the job is run

The stages keyword is at the top of the file. It defines workflow of the jobs. In this YAML, the unittest stage runs first. The unittest stage is configured to run the unittest job. If the unittest job fails, the pipeline fails and no further stages in the workflow are run. However, if the unittest job passes, the coverage_test stage runs the coverage_test job. This workflow runs until all jobs pass or one job fails.

This YAML file only scratches the surface of what is possible with GitLab CI. Check out the GitLab CI Docs for more examples and keywords.

The main features in this PyPi specific pipeline:

The pipeline uses the custom Docker image that is built in the separate Docker build pipeline. This Docker image comes preinstalled with the necessary dependencies. If the jobs in this pipeline used a base Python Docker image, dependencies would have to be installed for each job. This would significantly increase the time it takes for the pipeline to run. Because the jobs use a custom Docker image with the dependencies already installed, the jobs don’t need to be configured to install the project dependencies.

The pipeline jobs have rules attached to them. This ensures that jobs are not run every time the pipeline triggers. Instead, there are specific rules attached to each job that triggers the job to run.

The test jobs run anytime there is a commit or merge request regardless of the branches involved. The $CI_COMMIT_BRANCH and $CI_MERGE_REQUEST predefined environment variables ensure this happens.

The dev_deploy job only runs when there is an open merge request. This dev version of the pypi package is completely separate from and in no way impacts the prod version.

The prod_deploy job only runs when there is a commit to the master branch. I have this rule in place because users will only want to deploy PyPi packages that are ready for public use.

Now that the pipeline is complete, it’s time to use it in a PyPi project.

Using the Pipelines in the Dataframez Project

I showed this .gitlab-ci.yml configuration file above. This is the .gitlab-ci.yml for the dataframez project. Dataframez is an extension to the Pandas library and is deployed to PyPi. The dataframez project is a perfect fit for the pipelines I created.

For the external PyPi pipeline to work, dataframez needs:

requirements.txt
Dockerfile (this can be a copy of the one I created earlier)
a .gitlab-ci.yml that uses include statements in order to run the external pipelines
The project needs to be configured to use a GitLab runner
Project environment variables configured to connect to PyPi via twine

dataframez’s Dockerfile

dataframez’s requirements.txt

The above files are placed at the root of the project.

dataframez’s .gitlab-ci.yml

This .gitlab-ci.yml uses two includes YAML files from two repositories.

The first is the Docker build YAML that builds a Docker image from the dataframez project’s Dockerfile.

The second is the PyPi pipeline that runs the dataframez unit tests and static code analysis tests. It then deploys the dataframez package to PyPi if the rules are met (a commit to the master branch).

The dataframez .gitlab-ci.yml combines the two external pipelines into one. Notice that this file defines a variable called IMAGE_NAME which equals $CI_REGISTRY_IMAGE. The Docker build job uses $CI_REGISTRY_IMAGEto name the created Docker image.

IMAGE_NAME is used in the PyPi pipeline as the Docker image that runs the jobs in the workflow. The workflow of the pipeline is configured so the Docker build job runs first if the Dockerfile or requirements.txt are updated. This ensures that the PyPi pipeline is always using the most updated Docker image.

Now when any branch in the dataframez project is committed to, the CI/CD pipeline will get triggered to run.

Closing Thoughts

While this PyPi pipeline is just one of many reusable pipelines that can be created for nearly any type of project, the baseline process of how to build reusable pipelines is relatively similar. GitLab CI holds many more avenues to explore. Hopefully, I was able to give you a solid overview of how CI/CD pipelines work in GitLab.

Ready to accelerate your digital transformation?

At Hashmap, we work with our clients to build better, together.

If you’d like additional assistance in this area, Hashmap offers a range of enablement workshops and consulting service packages as part of our consulting service offerings, and would be glad to work through your specifics in this area.

Feel free to share on other channels and be sure and keep up with all new content from Hashmap here. To listen in on a casual conversation about all things data engineering and the cloud, check out Hashmap’s podcast Hashmap on Tap as well on Spotify, Apple, Google, and other popular streaming apps.

Hashmap on Tap | Hashmap Podcast

A rotating cast of Hashmap hosts and special guests explore different technologies from diverse perspectives while enjoying a drink of choice.

www.hashmapinc.com

Creating Reusable CI/CD Pipelines with GitLab

What is PyPi?

What is GitLab?

CI/CD in GitLab

Reusable Pipelines

Creating the Docker Build Pipeline

The PyPi CI/CD Pipeline

The main features in this PyPi specific pipeline:

Closing Thoughts

Ready to accelerate your digital transformation?

Hashmap on Tap | Hashmap Podcast

A rotating cast of Hashmap hosts and special guests explore different technologies from diverse perspectives while enjoying a drink of choice.

Other Tools and Content You Might Like

How We Automated Snowflake Data Profiler

with a combination of GitHub, unit testing, and CircleCI

5 Steps to Converting Python Jobs to PySpark

Moving from Pandas to PySpark using Apache Arrow or Koalas

Snowflake Utilities & Accelerators | Do more with Snowflake | Hashmap

Try out all the Snowflake utilities that Hashmap has available and do more with Snowflake: Snowflake Inspector…

Hashmap Megabytes | Bite-Size Video Series

Hashmap Megabytes is a weekly video series in which mega cloud ideas are explained in bite-size portions.

Written by Sam Kohlleffel