Setup Gitlab CI/CD for Machine Learning Project

Published in

Data Folks Indonesia

7 min readJun 24, 2023

This article demonstrates a basic configuration for CI/CD on Gitlab. Continous Integration and Continous Development are terms that describe an end-to-end process from several changes in development to production environment. CI/CD automates all the work for code integration such as integration test, unit test, and regression test, as well as deployment process with a set of predefined criteria. Hence, CI/CD reduce manual effort in order to maintain the quality of the software.

This article focuses on inference part and you may need to add more layers for model development part. But, it almost the same process.

Prerequisites

Please read these pages to grasp the concept of CI/CD

GitLab CI/CD | GitLab

Learn how to use GitLab CI/CD, the GitLab built-in Continuous Integration, Continuous Deployment, and Continuous…

docs.gitlab.com

MLOps: Continuous delivery and automation pipelines in machine learning | Cloud Architecture Center…

Last reviewed 2023-05-18 UTC This document discusses techniques for implementing and automating continuous integration…

cloud.google.com

andd yeap, this is me writing technical tutorial while thinking about going to the beach

Let’s continue:

Create Gitlab account at gitlab.com
Create fly.io account for web api deployment
Create repository on gitlab.com

Clone Repository

run git clone git@gitlab.com:<your-username>/iris-api.git

Build a simple RESTFUL API

Create src/main.py

"""Iris Web API Service."""
from fastapi import FastAPI
from pydantic import BaseModel
import numpy as np
from src import distance, iris

dataset = iris.get_iris_data()

app = FastAPI()


class Item(BaseModel):
    """Input class for predict endpoint.

    Args:
        BaseModel (BaseModle): Inherited from pydantic
    """

    sepal_length: float
    sepal_width: float
    petal_length: float
    petal_width: float


@app.get("/")
def homepage():
    """Homepage for the web.

    Returns:
        str: Homepage
    """
    return "Homepage Iris Flower - tags 0.0.2"


@app.post("/predict/")
async def predict(item: Item):
    """Predict function for inference.

    Args:
        item (Item): dictionary of sepal dan petal data

    Returns:
        str: predict the target
    """
    sepal_length = item.sepal_length
    sepal_width = item.sepal_width
    petal_length = item.petal_length
    petal_width = item.petal_width

    data_input = np.array([[sepal_length, sepal_width, petal_length, petal_width]])

    result = distance.calculate_manhattan(dataset, data_input)
    return result

Create src/iris.py

"""Load iris dataset from scikit-learn."""
from sklearn import datasets


def get_iris_data():
    """Load iris dataset.

    Returns:
        set: consists of X, y, feature names, and target_names
    """
    iris = datasets.load_iris()
    x_data = iris.data
    y_label = iris.target
    features_names = ["sepal_length", "sepal_width", "petal_length", "petal_width"]
    target_names = iris.target_names

    return x_data, y_label, features_names, target_names


if __name__ == "__main__":
    x_data, y_label, features, target_names = get_iris_data()

    print("X", x_data)
    print("y", y_label)
    print("features", features)
    print("target_names", target_names)

Create src/distance.py

"""Distance module for calculating distance between data input and dataset."""
import numpy as np


def calculate_manhattan(iris_data: np.ndarray, input_data: np.ndarray):
    """Calculate the distance between 2 vectors using manhattan distance.

    Args:
        dataset (np.ndarray): Iris dataset
        input_data (np.ndarray): 1x4 matrix data input

    Returns:
        string: Return prediction
    """
    x_data, y_label, _, target_names = iris_data

    distance = np.sqrt(np.sum(np.abs(x_data - input_data), axis=1))
    distance_index = np.argsort(distance)
    y_pred = target_names[y_label[distance_index[0]]]

    return y_pred


if __name__ == "__main__":
    dataset = [
        np.array([[4.9, 3.0, 1.4, 0.2], [4.9, 3.0, 1.4, 0.9]]),
        [0, 0],
        ["sepal_length", "sepal_width", "petal_length", "petal_width"],
        ["setosa", "versicolor", "virginica"],
    ]
    sample_data = np.array([[4.9, 3.0, 1.4, 0.2]])
    print(calculate_manhattan(dataset, sample_data))

Create test/test_distance.py

import numpy as np
from src.iris import get_iris_data
from src.distance import calculate_manhattan 

def test_calculate_manhattan():
    dataset = get_iris_data()
    input_data = np.array([[4.9, 3.0, 1.4, 0.2]])
    result = calculate_manhattan(dataset, input_data)
    assert result == 'setosa'

Create Dockerfile

FROM python:3.10

EXPOSE 8000

WORKDIR /app

COPY . .

RUN pip install -r requirements.txt

ENTRYPOINT ["uvicorn", "src.main:app", "--host", "0.0.0.0", "--port",  "8000"]

Make requirements.txt

# python
pydoclint>=0.0.10
pylint>=2.17.0
black>=22.6.0
pydocstyle>=6.1.1
pytest>=7.1.2

# web app
fastapi>=0.98.0
uvicorn>=0.22.0

# models
numpy>=1.21.6
scikit-learn>=1.2.2

Create fly.toml

app = "iris-api-demo-stg"
primary_region = "sin"

[build]
  dockerfile = "Dockerfile"

[http_service]
  internal_port = 8000
  force_https = true
  auto_stop_machines = true
  auto_start_machines = true
  min_machines_running = 0

Setup Web App on Fly.io

First thing first, please install flyctl on your computer by following this link https://fly.io/docs/hands-on/install-flyctl/

Then, do the authentication flyctl auth login

Then create your personal access token for Gitlab here https://fly.io/user/personal_access_tokens save the token to your notepad, later we will add the token to gitlab environment.

Now, you need to create 2 app: staging and production.

Staging app

flyctl launch --auto-confirm --copy-config --dockerfile Dockerfile --name iris-api-demo-stg --now --org personal --region sin

Production app

flyctl launch --auto-confirm --copy-config --dockerfile Dockerfile --name iris-api-demo--now --org personal --region sin

Eventually, you will look something like this in your fly.io dashboard

Don’t forget to add access token fly.io to gitlab environment for deployment purposes. Add variable and named it as FLY_TOKEN.

Setup CI/CD

*drum-roll*

Now, let’s focus the main content here, configurating the CI/CD pipline.

Let’s create a new file namedgitlab-ci.yml and call this v1

image: python:latest

docker-build:
  stage: build
  script:
  - echo "Build Docker"

code-test:
  stage: test
  script:
  - echo "Run Code Test"

production:
  stage: deploy
  environment: production
  script:
  - echo "Deploy to fly.io"

This is a simple gitlab-ci that runs every single push that you make to remote repo. What it does is when you push a change, 3 jobs will be triggered. docker-build, code-test, and production.

Let’s dive in on how the things work.

image: python:latest means that all these jobs run on top of docker image of python latest version which you can find on docker hub.

docker-build is the name of the job. The name of the job can be anything and you can create numerous jobs in a single .yml file.

stage means which stage this job falls into. There are 3 common stages in the CI/CD pipeline, build, test and deploy.

environment is used for specify which environment this job will run. You will get a list of jobs that has specific environments. This allow you to deploy which commit you want to redeploy. Hence, this makes easier if something went south in the staging or production environment.

script allows you to write a shell command in the container. Think like a set of script will run in the terminal.

Once you done:

git add gitlab-ci.yml

git commit -m "add gitlab-ci.yml

git push

Then, you can navigate to pipeline tab

As you can see, there are 3 green check mark that shows successful jobs had been run. If it fails, the icon will be red cross.

Now, you have created a simple pipeline.

Let’s create a pipeline that usually used for ML API development.

image: python:latest

code-check:
  stage: build
  only:
    - merge_requests
  script:
  - echo "Build Docker"
  - pip install -r requirements.txt
  - pylint src --rcfile=.pylintrc
  - black src --check
  - pydocstyle src

code-test:
  stage: test
  only:
    - merge_requests
  script:
  - echo "Run Code Test"
  - pip install -r requirements.txt
  - pytest

staging:
  stage: deploy
  environment: staging
  only:
    - staging
  script:
  - echo "Deploy to fly.io in staging environment"
  - curl -L https://fly.io/install.sh | sh
  - bash
  - /root/.fly/bin/flyctl deploy --app iris-api-demo-stg --access-token $FLY_TOKEN

production:
  stage: deploy
  environment: production
  only:
    - tags
  script:
  - echo "Deploy to fly.io in production environment"
  - curl -L https://fly.io/install.sh | sh
  - /root/.fly/bin/flyctl deploy --app iris-api-demo --access-token $FLY_TOKEN

We have 4 jobs:

code-check this job runs code quality check such as linting using pylint, formatter using black, and docstring using pydocstyle. This is used to make sure that the written code follow the guidelines. This job will only run on merge request. If you just push to the remote branch, it won’t trigger this job.
code-test Then, we have code test, we have already created a simple unit test above in the test_main.pyThis is to ensure that the module that we created run as expected.
staging this job will run if the merge request has been approved into staging branch. This will be automatically deployed to fly.io using stagging application. This allows you to do user acceptance test.
production Finally, we have production job. The purpose is quite similar with staging one. This job will be triggered if you create a tag in the repository.

Create tag for deploying into production web app

Once you create merge request and merge into staging branch. it will deploy to staging app. If it is as expected, you can proceed to merge request to main branch, then approve. Once done, you can create tag to deploy into production web app.

Conclusion

That’s more or less to setup CI/CD on Gitlab. This may seems simplified, I will create more and more complex pipeline that involves MLOps such as model tracking, data versioning, model registry, model monitoring, etc. Hit the follow button and please connect on Linkedin at https://www.linkedin.com/in/chandraandreas/