Step by step CI/CD configuration for a data science project with Azure DevOps and Azure Machine Lerning

There are a lot of tools and services for different parts of a data science project, but the best way to manage all project steps efficiently is to work in one workspace. I will show step by step how to do it in Azure.

Andrii Shchur
Geek Culture
6 min readMay 24, 2022

--

Photo by Pietro Jeng on Unsplash

There are two parts of work for any data science project: Data scientists build the ML model, and Developers build the app/web service and deliver it to end-users to consume. In this article, I will describe how to implement a continuous integration and continuous delivery (CI/CD) pipeline for a data science project that embeds the ML model into the app source code.

Create a new project in Azure DevOps:

Step 1 New project (Image by Author)

We also can use the Azure DevOps Demo Generator to provision the project on your Azure DevOps organization. This URL will automatically select the Azure Machine Learning template in the demo generator. This template contains code and pipeline definitions for a machine learning project demonstrating how to automate the end-to-end ML/AI project.

Step 1 New project from Azure demo (Image by Author)

Then we need to create a service connection. Click to gear Settings > Service connection > Create service connection.

Service connection (Image by Author)

Let’s select Azure Resource Manager, for more details about Azure service connection you can read in the documentation.

Service connection types (Image by Author)

Then select the Authentication method:

Authentication method (Image by Author)

And finally Fill in all information, to get it you can generate Publish Settings in the Azure portal.

Azure service connection form (Image by Author)

As a result, you will see it in your service connection list.

Service connections list (Image by Author)

Now we are ready to configure CI/CD pipelines for our data science project. Firstly, I would like to define a little bit the definition of CI and CD for data science projects.

Most data science projects consist of next steps:

  • Data collection
  • Feature extraction
  • Data cleaning and pre-processing
  • Data validation
  • Model building
  • Model testing
  • Model Deployment

All these steps are performed by different team members — data engineers, data/business analysis, data scientists, machine learning engineers, ML Ops engineers, … engineers …

Any changes to these steps could affect the entire process flow. That is why we need to use CI/CD to avoid any issues for end-users.

In Data science project Continuous integration it is a step where we want to retrain our model. Firstly, we create a branch, train a model and commit changes to the branch then an automated process would build your code in a specific environment, and run tests. Successful CI means new code changes to an app are regularly built and tested.

The goal of Continuous delivery is to have a codebase that is always ready for deployment in a production environment. It is mean that the new code/model is ready and the rest of the team members could access it.

Continuous deployment can refer to automatically releasing a developer’s changes from the repository to production, where it is usable by customers.

Let’s start with Continuous integration. Navigate to Pipeline.

Step 2 Select Pipeline (Image by Author)

Select DevOps-for-AI-CI and click Edit or create a New pipeline.

Step 2 Edit Pipeline (Image by Author)

The pipeline will look like this:

Pipeline steps (Image by Author)

First of all, we need to define where our codebase is stored. In my case, it is Azure DevOps.

The first step of this pipeline is to prepare the Environment to run all the next scripts.

The second step is to create or use the existing Environment in Azure or we can use Local but it is optionally. We need to define the Azure service connection that we have created earlier and select the script from our codebase which is responsible for Service creation and configuration in Azure.

Create and config Workspace (Image by Author)

The third step in this pipeline is model training. All we need is also to define the Azure service connection and script which we will execute in the workspace that was created in the previous step.

Model training step (Image by Author)

The fourth step is to evaluate and select the best model. We also need to define the Azure service connection and script which we will execute.

Models evaluation step (Image by Author)

The fourth step is to register(save our pre-trained model). We also need to define the Azure service connection and script which we will execute.

Register Model (Image by Author)

And the last step we need to edit is Create Scoping Docker Image to use it in the CD step. We also need to define the Azure service connection and script which we will execute.

Create Scoring Docker Image (Image by Author)

So in CI, we will perform:

  • Prepare the python environment
  • Get or Create the workspace for AML service
  • Submit Training job on the remote DSVM / Local Python Env
  • Compare the performance of different models and select the best
  • Register model to the workspace
  • Create Docker Image for Scoring Web service
  • Copy and Publish the Artifacts to Release Pipeline

It is time to config the CD part of our pipeline. In this exercise, we will configure the Release pipeline which will deploy the image created from the build pipeline to Azure Container Instance and Azure Kubernetes Services

To config, it clicks Releases > Deploy Webservice > Edit

Config CD (Image by Author)

We need to make a few changes in QA and Prod stages.

Pipeline editing (Image by Author)

All we need to do are:

  1. Select an appropriate Stage
  2. In each stage we will config:

2.1 Azure subscriptions

2.2 Python script we would like to run

QA stage tasks example (Image by Author)

And that is all, we ready to run and check the whole CI/CD pipeline. In this example, we have a trigger to run it after the commit any new changes to the master branch. So, we can change for example two things:

  1. Change Azure CLI version
Azure CLI version change (Image by Author)

2. Add subsciption_id

Add subscription_id (Image by Author)

Then we need to save all changes in the current branch and make PR to the master branch. CI pipeline will run automatically after the PR merge.

CI pipeline (Image by Author)

And then CD will automatically run the QA stage pipeline, Prod stage need approval. All these things we can easily config as we want.

CD pipeline (Image by Author)

As the result, we will have all services in Azure.

Azure services (Image by Author)

Conclusion

Azure DevOps and Machine learning Service are really good tools to manage the whole cycle of all data science steps. With this tool, we could create CI/CD process and manage every detail in it.

References

You can also find one more example in my previous article — CI/CD Pipeline with Azure DevOps for Data Science project.

--

--