Building a data engineering project. Part 4 — exploring CI/CD

7 min readMar 21, 2022

Credit: https://itglobal.com/wp-content/uploads/2019/11/devops-manage-it.jpg

We’ve come a long way in making our data engineering project at this stage (see Part 3). At this point, we are ready to make the final step through the series:

I want to add the final touch. Something that is used in almost every industry project and is crucial for software engineers of any specialty to be familiar with. Welcome to CI/CD.

Intro

In this part of the series, I want to show a sample CI/CD pipeline for our inverted index project. We will use Github Actions as an automation tool. Additionally, I will demonstrate several popular CI/CD technologies by replicating the original pipeline. We will finish with a discussion on an emergent DevOps paradigm hugely influenced by the adoption of machine learning across industries.

What is CI/CD and why do we care

The concept of pipeline automation means ensuring that code satisfies specific requirements that allow its developers to say, ‘This new code can replace the one in production.’ These requirements include, but are certainly not limited to:

adherence to the chosen code style: does your Python code satisfy PEP-8? (note, there are many language standards, and developers choose the one at the early stages of the project)
code integrity: how changes that we made interact with the rest of the code according to the tests we have
sufficient documentation and test coverage: a project must not degrade in these two for new changes to get accepted
following the release strategy: when the linter is happy and all tests pass, we are safe to update the production

Continuous integration (CI) is about automating the validation of the first three items in the list. Continuous delivery (CD) has to do with the last one if we update the production code in version control. Finally, continuous deployment (also CD) means updating your production application with the recent code changes (see this lecture by Martin Fowler).

As a follow-up, here are additional great articles on CI/CD by Red Hat and GitHub.

Extending our project with a GitHub workflow

One of the basic options for CI/CD is offered by Github as GitHub Actions.

Our app has components in Java and Python, and I suggest making two different workflows that run based on what part of the system has changed. Here is what a standalone pipeline with a linter and test runner looks like for Python:

name: Python CIon:
  pull_request:
    branches: ["*"]
    paths: ["**.py"]
  push:
    branches: ["*"]
    paths: ["**.py"]jobs:
  build:
    runs-on: ubuntu-lateststeps:
      - uses: actions/checkout@v2
      - name: Set up Python
        uses: actions/setup-python@v2
        with:
          python-version: "3.8"
      - name: Install essentials
        run: |
          python -m pip install --upgrade pip pipenv- id: cache-pipenv
        uses: actions/cache@v1
        with:
          path: ~/.local/share/virtualenvs
          key: ${{ runner.os }}-pipenv-${{ hashFiles('**/Pipfile.lock') }}- name: Install other dependencies
        if: steps.cache-pipenv.outputs.cache-hit != 'true'
        run: |
          pipenv sync
          pipenv install flake8
      - name: Lint
        run: |
          # stop the build if there are Python syntax errors or undefined names
          pipenv run flake8 . --count --select=E9,F63,F7,F82 --show-source --statistics
          # exit-zero treats all errors as warnings. The GitHub editor is 127 chars wide
          pipenv run flake8 . --count --exit-zero --max-complexity=10 --max-line-length=127 --statistics
      - name: Test with pytest
        run: |
          pipenv run python -m nltk.downloader wordnet omw-1.4 punkt stopwords
          pipenv run python -m pytest -m "not skip" tests/test_python

Note a couple of things here:

The workflow runs only if *.py files change.
There is only one job — build — containing dependency setup, lining, and testing steps.
Python dependencies are cached! This way, we save around 1 minute for each workflow run.

The first thing to do to build a workflow is to look for a suitable template. Many templates can be found by surfing the Actions -> New workflow tab of a Git repository. To build this workflow, I started with an official Github Actions template, ‘Python application,’ and extended it with caching, pipenv, and pytest.

Having seen CI for Python, note how much Java workflow has in common:

name: Java CI with Gradleon:
  pull_request:
    branches:
      - '*'
  push:
    branches:
      - '*'jobs:
  build:runs-on: ubuntu-lateststeps:
    - uses: actions/checkout@v2
    - name: Set up JDK 8
      uses: actions/setup-java@v2
      with:
        java-version: '8'
        distribution: 'adopt'
    - name: Grant execute permission for gradlew
      run: chmod +x gradlew
    - name: Cache Gradle packages
      uses: actions/cache@v2
      with:
        path: |
          ~/.gradle/caches
          ~/.gradle/wrapper
        key: ${{ runner.os }}-gradle-${{ hashFiles('**/*.gradle*', '**/gradle-wrapper.properties') }}
        restore-keys: |
          ${{ runner.os }}-gradle-
    - name: Build with Gradle
      run: ./gradlew clean build
    - name: Test Coverage
      run: ./gradlew test

Note that by specifying

uses: actions/setup-java@v2
with:
   java-version: '8'
   distribution: 'adopt'

We use Github runner with Java 8 preinstalled. The rest of the steps translate into ‘take cache from previous Gradle build and run tests.’

These two workflows are what will save us time significantly in the future by automating the execution of our tests, ensuring code quality is high, and builds are running just fine. Yet, what can we say about GitHub Actions alternatives?

Alternative automation tools: Jenkins, GitLab, and Azure Pipelines

Certainly, GitHub Actions is not the only tool for automation. You may have a look at a list of popular alternatives here. Here, I want to show by example how different tools relate to each other. Indeed, the differences are far more profound: Gitlab has rich support for external services, Azure Pipelines can be templated, Jenkins is hosted on a separate server by itself, etc. But we can get a sense of how the actual pipelines look.

One of them, Jenkins, is a trendy CI/CD instrument. Here, you create pipelines using Groovy language (with pros and cons). Let’s have a look at the pipeline for the text tokenization service:

pipeline {
  agent { docker { image 'python:3.8.12-alpine' } }
  stages {
    stage('build') {
      steps {
        sh 'pip install --upgrade pip pipenv'
        sh 'pipenv sync'
        sh 'pipenv install flake8'
      }
    }
    stage('lint') {
        steps {
            sh 'pipenv run flake8 . --count --select=E9,F63,F7,F82 --show-source --statistics'
        }
    }
    stage('test') {
      steps {
        sh 'pipenv run python -m nltk.downloader wordnet omw-1.4 punkt stopwords'
        sh 'pipenv run python -m pytest -m "not skip" tests/test_python'
      }   
    }
  }
}

GitLab

It is another tool popular due to its simplicity (YAML-based pipelines) and flexibility. You can find a lot of sample pipelines here. As an example, let’s continue to reimplement our pipeline for text tokenization.

image: python:3.8.12-alpinevariables:
  PIP_CACHE_DIR: "$CI_PROJECT_DIR/.cache/pip"
  PIPENV_CACHE_DIR: $CI_PROJECT_DIR/.cache/pipenvcache:
  paths:
    - $PIP_CACHE_DIR
    - $PIPENV_CACHE_DIRbefore_script:
  - pip install --upgrade pip pipenv
  - pipenv sync
  - pipenv install flake8lint:
  script:
    - pipenv run flake8 . --count --select=E9,F63,F7,F82 --show-source --statisticstest:
  script:
    - pipenv run python -m nltk.downloader wordnet omw-1.4 punkt stopwords
    - pipenv run python -m pytest -m "not skip" tests/test_python

Azure Pipelines

A tool of choice if you are using the Azure DevOps platform. This link forwards you to various templates.

trigger:
- {{ branch }}jobs:
  - job: "PythonCI"
    pool: { { pool } }
    strategy:
      matrix:
        Python37:
          python.version: "3.7"
        Python38:
          python.version: "3.8"
      maxParallel: 2steps:
      - task: UsePythonVersion@0
        inputs:
          versionSpec: "$(python.version)"
        displayName: "Use Python $(python.version)"- script: |
          python -m pip install --upgrade pip pipenv
          pipenv sync
          pipenv install flake8 pytest-cov
        displayName: "Install dependencies"- script: |
          pipenv run flake8 . --count --select=E9,F63,F7,F82 --show-source --statistics
        displayName: "lint"- script: |
          pipenv run python -m nltk.downloader wordnet omw-1.4 punkt stopwords
          pipenv run python -m pytest -m "not skip" \
          --doctest-modules --junitxml=junit/test-results.xml --cov-report=xml \
          tests/test_python
        displayName: "test"

Beyond CI/CD: a new paradigm that enters DevOps practices

With ubiquitous applications of machine learning (ML) these days, ML models become a new class of artifacts that is becoming more and more widespread, and it is there to stay.

This article by Google Cloud goes into the DevOps for ML in detail. Notably, with continuous training (CT) practice, we automate model management in production. Among other use cases, this new practice covers:

model retraining if a monitoring tool detects a specific type of drift;
model training and deployment as a release step;
automated sanity checks of the model for data scientists who work on new model features;
and others.

When you need CT in your pipeline, some tools start to shine due to their flexibility in pipeline definitions and extensive plugin support. And here I mean Jenkins with its Groovy language that allows going fancy with conditional execution, looping, etc.

To learn more about using Jenkins for continuous training, you can take a look at these two articles: article 1 and article 2.

Summary and takeaways

In this part of the series, we explored how to extend our project with CI / CD practice. We implemented a pipeline for our project using Github Actions and compared this pipeline with its alternatives in other popular platforms: Jenkins, Gitlab, and Azure Pipelines.

Finally, we discussed a novel DevOps paradigm that extends CI/CD with the continuous training stage for machine learning models.

Here are some takeaways from this article:

CI/CD is essential for modern software development and is hugely valued beyond the data engineering field.
Each automation tool has its pros and cons that may not align with your needs at some project stage. Therefore, choose the tool wisely before the project starts.
VCS platforms allow using 3rd-party solutions for your CI/CD. For example, you may use Circle CI with Github instead of Github Actions.
Automation paradigms evolve. With machine learning getting more and more popular, continuous training gains adoption by data science teams.

Final words

I want to thank you for being part of the ‘Building a data engineering project’ series. There has been quite a lot, and I hope that this experience will help you become a better engineer and broaden your horizons within the programming world!