Machine Learning CI/CD with CircleCI and AWS Sagemaker

Timothy Cheung
7 min readJul 4, 2023

There are many benefits of incorporating CI/CD into your ML pipeline, such as automating the deployment of ML models to production at scale.

The focus of this article is to illustrate how to integrate AWS Sagemaker model training and deployment into CircleCI CI/CD pipelines. The structure of this project is a monorepo containing multiple models. The monorepo approach can be advantageous over the polyrepo approach, including simplified dependency versioning and security vulnerability management.

You can find the code for this tutorial in this Github repository.

What is CI/CD?

CI/CD stands for Continuous Integration/Continuous Delivery. Its goal is to maximize developer efficiency by automating the process of shipping code from commit to production. Applied to an ML pipeline, it helps data scientists focus their time on working with data and building models, rather than on putting models into production and deployment infrastructure. Additionally, the value of CI/CD becomes more apparent as the system’s complexity increases. A team with limited resources managing multiple models being served to various parts of the organization can save a lot of time through the automation inherent in CI/CD.

Environment Variables

The first step is to set up AWS credentials in your project on CircleCI. You can do so by going to your project’s settings, clicking on Environment Variables, then clicking the Add Variable button to enter a name and value of the new environment variable. Once set, you can pick up these environment variables in your Python script using os.environ.

In this sample project, we’ve notably stored AWS access keys and a Sagemaker execution role ARN in our environment variables. It should be noted that boto3 automatically retrieves the environment variables named AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY when we create a boto3 session, so we should not rename them to anything else.

Secrets saved on CircleCI

Additionally, we store environment variables specific to a CI/CD job by declaring them with the environment key in the config file. This is not strictly necessary in our case, but we do so for the sake of demonstration.

environment:
MODEL_NAME: abalone-model
MODEL_DESC: abalone model description text

In our Python scripts, we retrieve those environment variables as follows:

model_name = os.environ["MODEL_NAME"]
model_description = os.environ["MODEL_DESC"]
role_arn = os.environ["SAGEMAKER_EXECUTION_ROLE_ARN"]

Models

For the sake of demonstration, we’ve taken two models commonly found in AWS documentation, Abalone and Churn. Both of these are simple XGBoost models, with Abalone being a linear regressor and Churn being a binary classifier. Each model is contained in its own folder, and each folder contains the following files:

gather_data.py

This file downloads and preprocesses the data for its model, then uploads the data to S3. We upload the train and validation datasets in separate folders, as is required by Sagemaker.

# Upload training and validation data to S3
csv_buffer = io.BytesIO()
train_data.to_csv(csv_buffer, index=False)
s3_client.put_object(Bucket=bucket, Body=csv_buffer.getvalue(), Key=f"{model_name}/train/train.csv")

csv_buffer = io.BytesIO()
validation_data.to_csv(csv_buffer, index=False)
s3_client.put_object(Bucket=bucket, Body=csv_buffer.getvalue(), Key=f"{model_name}/validation/validation.csv")

train_register.py

This file trains our model then registers it with the model registry, and is the first place in our CI/CD where we actually make use of Sagemaker.

When configuring the Sagemaker XGBoost Estimator, we can specify the S3 path to output the model artifacts using output_path. We must provide it the Sagemaker session and a Sagemaker execution role ARN, as we are executing this code outside of Sagemaker notebooks, and thus it will not automatically retrieve that information correctly.

# Configure training estimator
xgb_estimator = Estimator(
base_job_name = model_name,
image_uri = image_uri,
instance_type = "ml.m5.large",
instance_count = 1,
output_path = model_location,
sagemaker_session = sagemaker_session,
role = role_arn,
hyperparameters = {
"objective": "reg:linear",
"max_depth": 5,
"eta": 0.2,
"gamma": 4,
"min_child_weight": 6,
"subsample": 0.7,
"verbosity": 2,
"num_round": 50,
}
)

After training the model, we push the model package to the Sagemaker model registry. Sagemaker makes the differentiation between a model and a model package. A model is just the object that we would deploy to an endpoint and run inference, whereas the model package contains all the artifacts associated with that model, such as model weights, evaluation results, and configuration files. We push model packages to the model registry, not models.

We want to make use of a model registry so that we can easily refer to trained models in subsequent steps, such as deployment, or when we want to roll back models to previous versions. Notice that we pre-approve the model package, as we will make use of CircleCI approval jobs to manage model approval.

# Retrieve model artifacts from training job
model_artifacts = xgb_estimator.model_data

# Create pre-approved cross-account model package
create_model_package_input_dict = {
"ModelPackageGroupName": model_name,
"ModelPackageDescription": "",
"ModelApprovalStatus": "Approved",
"InferenceSpecification": {
"Containers": [
{
"Image": image_uri,
"ModelDataUrl": model_artifacts
}
],
"SupportedContentTypes": [ "text/csv" ],
"SupportedResponseMIMETypes": [ "text/csv" ]
}
}

create_model_package_response = sagemaker_client.create_model_package(**create_model_package_input_dict)

deploy.py

This file deploys the latest approved model package to the model endpoint, either by creating the endpoint if it does not already exist or updating an existing endpoint.

To get the latest approved model package, we use a Sagemaker function to list existing model packages by descending creation time and retrieve the model package ARN:

# Get the latest approved model package of the model group in question
model_package_arn = sagemaker_client.list_model_packages(
ModelPackageGroupName = model_name,
ModelApprovalStatus = "Approved",
SortBy = "CreationTime",
SortOrder = "Descending"
)['ModelPackageSummaryList'][0]['ModelPackageArn']

Then we create a model out of the model package:

# Create the model
timed_model_name = f"{model_name}-{current_time}"
container_list = [{"ModelPackageName": model_package_arn}]

create_model_response = sagemaker_client.create_model(
ModelName = timed_model_name,
ExecutionRoleArn = role_arn,
Containers = container_list
)

And create an endpoint config using that model:

# Create endpoint config
create_endpoint_config_response = sagemaker_client.create_endpoint_config(
EndpointConfigName = timed_model_name,
ProductionVariants = [
{
"InstanceType": endpoint_instance_type,
"InitialVariantWeight": 1,
"InitialInstanceCount": endpoint_instance_count,
"ModelName": timed_model_name,
"VariantName": "AllTraffic",
}
]
)

Finally, we update the endpoint with the new config:

create_update_endpoint_response = sagemaker_client.update_endpoint(
EndpointName = model_name,
EndpointConfigName = timed_model_name
)

Dynamic Configuration

As we’ve taken a monorepo approach, we need a way to only run our CI/CD for the model that has been changed. Otherwise, when we merge changes to the Abalone model, the Churn model will also retrain and redeploy! This is where CircleCI’s dynamic configurations come in handy. This feature allows us to detect whether changes have been made to a particular folder, and if so, set the value of a pipeline parameter. In turn, the pipeline parameter will determine which workflows will run in our CI/CD pipeline.

Setup configuration

The first step in making use of dynamic configs is the setup config. In our example repository, it is named config.yml. We employ the path-filtering orb to identify which folders contain code changes.

Note that we compare files to those on the main branch. Furthermore, we map changes in specific folders to parameter values. For example, if there are changes detected in the abalone_model folder, then the pipeline parameter deploy-abalone will be set to true. Additionally, we specify the path of the configuration file to trigger once path filtering and pipeline parameter value updates are complete.

base-revision: main
mapping: |
abalone_model/.* deploy-abalone true
churn_model/.* deploy-churn true
config-path: ".circleci/dynamic_config.yml"

Continue configuration

With the pipeline parameter values updated from the setup config, we now run the continue config, which in our example repository is named dynamic_config.yml. To make it easier to understand what the config file is doing, let’s focus on the abalone-model workflow.

workflows:
abalone-model:
when: << pipeline.parameters.deploy-abalone >>
jobs:
- abalone-model-train:
filters:
branches:
ignore:
- main
- request-deployment:
type: approval
filters:
branches:
ignore:
- main
requires:
- abalone-model-train
- abalone-model-deploy:
filters:
branches:
only:
- main

Firstly, the workflow will only run when the pipeline parameter deploy-abalone is true. Next, we run the job abalone-model-train, which executes the train_register.py file. Then we trigger the request-deployment job, which is an approval job that requires the user to manually approve on CircleCI in order for the workflow to proceed. This would be a point at which a reviewer would check the model evaluation metrics on Sagemaker before allowing the model to be deployed to the endpoint. Finally, if approval is given, the abalone-model-deploy job executes deploy.py.

Note that the training and approval jobs ignore the main branch, whereas the deploy job happens only on the main branch. This allows new model versions to be trained when the developer is working on updates to the model on a developer branch without triggering any sort of deployment. Then, once the code changes are accepted and merged into main, the deployment job gets triggered without triggering any further retraining of the model.

Pipelines on CircleCI

Here is what we see on CircleCI when code changes are pushed to the Abalone model in a developer branch. The dynamic configuration has selectively run only the abalone training pipeline. The request-deployment approval job gatekeeps the code changes from being merged into main. Once it is approved, then the PR on Github can be merged.

Only the training pipeline is run when on a developer branch

Here is what we see once code changes are merged to the main branch. This time, since the code changes are on main, the dynamic configuration selectively runs only the abalone deployment pipeline.

Only the deployment pipeline is run when on the main branch

Conclusion

We’ve demonstrated the use of CircleCI along with AWS Sagemaker to create an end-to-end ML pipeline. It automates the process of training a model and deploying it to an endpoint for real-time inference.

It uses a monorepo setup where each model is contained in its own folder. Furthermore, it uses CircleCI’s dynamic configs to adapt each pipeline to the model that experienced code changes.

--

--