Machine Learning CI/CD Pipeline with Github Actions and Amazon SageMaker

7 min readFeb 24, 2022

When you begin working on ML project for a real Business Case, you should consider two essential questions:

How long would it take your company to implement a modification involving a single line of code?
Can you do this on a consistent and repeated basis?

So, if you are not satisfied with the answers you have, In this article we will describe and show an end-to-end machine learning pipeline that uses Github as a source control system with its powerful CI/CD tool Github actions and we will train and deploy our models using SageMaker.

What is CI/CD ?
Overview of the Project
Overview of the GitHub repository
CI/CD pipeline using Github Actions
Alternative solution
Conclusion
References /Additional Reading

What is CI/CD ?

Continuous integration (CI) and continuous deployment (CD) are standard methods in modern software development teams.

CI is concerned with how a project should be built and tested in different runtimes, automatically and constantly.
CD is required so that any new piece of code that passes automated testing may be pushed into production with minimal effort.

Overview of the Project

Technically, this project is a Github repository with two branches, main and development. The development branch is where you experiment with alternative ways to solve your machine learning challenge, whereas the main branch is just for deploying the chosen model. When you want to check how your development code works, you create a merge request (also known as a pull request) from the development branch to the main branch.

Github actions will create your training image based on your Dockerfile and submit it to ECR. Following that, it will begin a training job using the generated image, your training script, and input data from an S3 bucket.

On the merge request page, you can view the training job description, and once the training job is completed, you can also check the performance metrics.

If you and your teammate are pleased with the results, merge them into the main branch. This will trigger the deploy stage, which will establish a real-time inference endpoint using the most recent training job.

The diagram below depicts an overview of the pipeline:

Overview of The Github repository

The initial repository is composed of these files:

.gitignore
README.md
Dockerfile: Configuration of our training and serves Docker image.
training-job.py: the python file that starts the training job
training-script.py: training-job.py use this file as the training job entry point.
serve-script.py: A simple Flask app that will be used for Endpoint.
deploy.py: the python file that by default make an endpoint with the latest training job.

Now, we will be focusing on CI/CD pipeline using Github actions and Amazon SageMaker.

CI/CD pipeline using Github Actions

GitHub, the most famous hosted repository service offers an integrated solution to build and develop our workflows by automating processes using GitHub Actions. The events that occur in our GitHub repository, such as pushes, pull requests, releases, and so on, are utilized as triggers to start the workflows using Actions.

These processes are defined in YML files, and here is an example from our project to automate the creation and deployment of Docker images to ECR.

The basic attributes we used are:

name — The name of our workflow (optional)
on — GitHub event that triggers the flow. It can be repository events (push, pull requests, release).
jobs — Workflows must have at least one job. Each job must have an identifier, in our case it is build_image_push_to_ECR.
runs-on — The type of machine needed to run the job. These environments are called Runners. In our case, we used ubuntu 18.04.
steps — a list of all the commands or actions. Each step runs its process.
In our case, we have 3 steps: checkout, log in Amazon ECR and finally build/push image to ECR.
uses — identifies an action to use, defines the location of that action. An action can be uploading and downloading artifacts or checkout or configure any cloud account.
env — environment variables to store information that you want to reference in your workflow. we reference environment variables within a workflow step or an action, and the variables are interpolated on the runner machine that runs your workflow
run — runs the command in the virtual environment shell.
name — an optional name identifier for the step.

Our workflow has three stages:

Build a Docker Image and push to Amazon Elastic Container Registry (ECR).
Start Training job using this Image and Amazon SageMaker.
Deploy and make an endpoint with the latest training job.

1. Build Docker Image

Let’s build a Docker Image. Create a Dockerfile as shown below.

We build a Docker image that we will use for our training job using a Dockerfile. Then we push it to Elastic Container Registry (ECR), where all the different versions are present.

As you can see in the YML file above we need to set up the AWS environment which requires sensitive information like AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY. GitHub allows us to store secrets and we can access those secrets as variables.

To create a secret go to settings and select secrets.
You will be redirected to the secrets section. Click on “New secret”. Give the secret a name and add value. Click on “Add secret”.

In our case, we defined:

AWS_ACCESS_KEY_ID
AWS_SECRET_ACCESS_KEY
AWS_DEFAULT_REGION
BUCKET_NAME: name of our S3 bucket
PREFIX: name of our project that is a subfolder of the S3 bucket
IAM_ROLE_NAME: IAM role arn that has enough access (e.g. SageMakerFullAccess)
REPO_NAME: Elastic Container Registry (ECR) repo.

After updating all of the required secrets:

Once the environment is set up, our job will start to build a docker image and upload it to Elastic Container Registry (ECR).

2. Start Training Job

After executing the above stage, we will have a docker image at our service stored on the container registry. In this stage, we are going to use this Image to start training a job.

In this step, we need:

Training script.

Sample from the script to start the training job

Data from S3.

Image from ECR.

Amazon ECR repo contain Image (Photo by Author)

The pipeline will end once the training job is submitted and a comment on the merge request page will appear with informations about the submitted training job, such as its name, artifact location, hyperparameters used for this job, CloudWatch link to see what’s happening right now in your training job.

Training job performance report (Photo by author)

3. Deployment

As you can see in the image above, we have a Training job performance report. If you are happy with the results, merge the dev branch into the main branch, and the Deployment step will be triggered.

That’s all! Soon after executing this stage, the Endpoint will be available.

Alternative solutions

Here we used GitHub actions to build and push our Docker image, submit the training job, and deploy the final model.
These objectives might likewise be met in a variety of other ways. One that is more AWS-integrated is to use Step Functions, AWS SAM, CodePipeline and CodeBuild to build, tag, and upload the Docker image to Amazon ECR and then start the Step Functions workflow to train and deploy the custom ML model on SageMake.

The following diagram describes the general overview of the MLOps CI/CD pipeline.

Source: https://aws.amazon.com/blogs/machine-learning/build-a-ci-cd-pipeline-for-deploying-custom-machine-learning-models-using-aws-services/

Conclusion

We have seen the process of using GitHub Actions to set up a workflow that automates the deployment of a Machine Learning Model and Amazon web services for storing data, training, and real-time prediction.
The project is available at my Github Repository.

If you would like to get in touch, connect with me on LinkedIn.

References/Additional Reading