How to create a Machine Learning Pipeline with AMLS (Azure Machine Learning Services)

Svetlana Smagina

Published in

TotalEnergies Digital Factory

13 min readOct 18, 2021

with Najate Ochbouk and Syrine Bensalah

Photo by **Michael Skok** on **Unsplash**

Motivation behind this article

Data science projects have seen many transformations over the last few years. Some time ago, we thought more about proofs of concept (POCs), experimentations made by a data scientist on a subset of data. The idea is to explore existing theoretical concepts applied to business use cases based on a sample of existing data, in order to prove that some specific pain points of people’s daily tasks can be solved. When a POC succeeded, the project could continue towards the production stage. The problem was that the gap between the POC and the production stage was so big, that many projects did not move on to the next step because of several factors like:

the POC used a subset of data, sometimes not representative of the production data, which means that a model with a good performance during POC often does not generalize well in production and needs a manual rework.
the POC was often made on a quick and dirty development, in a local environment, without any code versioning, while a model in production environment needs to meet the requirements of any IT project (security, versioning, testing, continuous integration etc.)
The model created in a POC was designed with performance in mind without consideration for the real user needs.

Based on many returns on experience, tech giants like Microsoft, Google and Amazon (and others) have noticed that there is a need to create tools to reduce the gap between data science process made during POC and production.

In this series of articles, we aim at giving you an introduction to two tools, part of the Microsoft Azure platform, that can be used both during the experimentation process and in production. It is composed of 3 articles where you will learn how to create a Machine learning pipeline to manage data, automate the training of machine learning models, version and consume it using AMLS (Azure Machine Learning Services, part 1) or Databricks coupled with MLFlow (part 2). We will end up with a comparison of both in part 3 — AMLS vs Databricks/MLFlow.

What you will discover in this article 🧐:

The definition and main advantages of the ML pipelines and why they are useful in production.
The specific tools which will help us construct and manage those pipelines in AMLS.
An example of a real-time inference ML pipeline that could be used as a web service.

1 Overview and advantages of ML pipelines

1.1 What are ML pipelines and how do they help?

At a high level, a machine learning workflow will have three major phases: data preparation, training and deployment.

The data preparation phase includes data ingestion, cleaning, validation and transformation.

The training phase consists in using automated learning, hyper-parameter tuning and building custom models.

In the deployment phase you will serve and monitor your model.

These steps are grouped together as a set of computational tasks to form what we call an ML pipeline. We can put it in one script or break it down into modules. However, you could face restarts in your all-in-one script execution for every error encountered.

(learn more here 👉 https://ml-ops.org/content/end-to-end-ml-workflow)

1.2 Advantages of ML pipelines

Modularity

You start realizing the advantages of the pipelines when you need to iterate many times over. If you are playing video games 🎮, you should feel the relief when you reach a save point: whatever happens for the rest of the game, you are able to come back to that point without starting over. Well, this is also a pipeline principle: we need to be able to get back to the last valid state in case of a crash during the run without computing again the entire pipeline. Hence, modularity is a key advantage of pipelines leading to savings in calculation costs.

Easy to build and run

We could easily build up individual steps using Python scripts (Section 3). Once the pipeline is created, it is ready to be published. You can access and rerun it any time you want.

Data dependency & Orchestration

Pipelines automatically figure out data dependency steps and make sure each step is not running before any needed subsequent step is finished. If you have no dependencies between stages but you want them to run on different nodes at the same time, it could be run in parallel.

Heterogeneous compute options

Various stages can use different compute options and frameworks. In Azure, Spark could be used during data processing, CPU/GPU clustering for training, and ACI or AKS for deployment. A pipeline can combine together these heterogeneous compute options using different frameworks.

Collaboration

Different engineers working on different areas of the machine learning workflow (data engineers, data scientists) can develop at the same time on different stages of the pipeline.

1.3 Training and inference ML pipelines

A machine learning problem involves building the model to predict the target accurately. And here we are talking about the training pipeline where you will get the data, preprocess it, re-train your model, evaluate it and choose the best one according to your evaluation criteria. Once you have your trained model, the next step is to test it against real unseen data. That is when you will use an inference pipeline. There are two scenarios to focus on:

Real-time inference, where you want the response to come back immediately to the endpoint.
Offline inference, where instantaneous response is not mandatory, or you have too much data to process.

Inference pipelines combine preprocessing, prediction, and post-processing tasks. We use this pipeline to define and deploy any combination of pretrained or custom algorithms.

You’ve learned about the pipelines and their benefits🥳; in the further sections you will discover how to create training and inference pipelines in Azure Machine Learning Service (Section 2) and how to use these azure pipelines to create a real-time inference diabetes model (Section 3).

We will start with an overview of AMLS and its components to give the whole picture of AMLS and how it could be used for that specific machine learning problem.

2 AMLS Overview

AMLS is a workspace that brings together your compute machines, your data, and your models all in one, so you can provision and manage the assets (data, models etc.). You can create a training pipeline and automate it easily, managing data and model versions to allow model reproducibility.

Model reproducibility relates to the ability to go back to an ancient version of your model whenever needed. This involves versioning the code (through a version control system), versioning the data (be able to track datasets at a given point in time) used to train the model and all ML specific information (eg. Model hyperparameters, used libraries etc.) that were used to generate the model.

On the figure below you can see the whole process of interaction with AMLS. The left part corresponds to your local/virtual machine on which you will launch your pipeline. The right part is the AML compute where the calculations will be executed. The source data, training results, model (.pkl) will be stored on an Azure Blob Storage, and the model along with serving code image on an Azure Container Registry (ACR).

**Figure 1. AMLS** https://techcommunity.microsoft.com/t5/azure-global/new-reference-architecture-training-of-python-scikit-learn/ba-p/377113

AMLS Components

Let us take a closer look at each component of Figure 1 and see how to access it in Azure Workspace (Figure 2 — Figure 6).

Firstly, we will talk about the data stores. A data store is a resource to store data. It can be for example, Azure Blob Storage where we can store sources and training results (Figure 1), Azure Data Lake Storage or a SQL database. You can interact with AMLS UI to visualize the datastore assets as shown in Figure 2.

A Compute target (Azure Machine Learning compute, Figure 1) is a machine (e.g. DSVM — Data Science Virtual Machine) or a set of machines (e.g. Databricks clusters) dedicated to scripts execution. You can use interactive UIs like Jupyter notebook or R studio to go through all steps of the ML pipeline to create your flow, which will be run on compute target.

There are also automatically scalable GPU or CPU clusters. So, if you are not using them, they will be automatically shrinking to zero machines. You can find your compute information details in the AMLS Compute tab (Figure 3).

Azure Container Registry (ACR) is the place where we can build, store and manage container images and artifacts.

Once models are trained you can store them in Azure Blob Storage as a .pkl file and execute it in a container instantiated from ACR.

AML Experiment 🧪

Moreover, AMLS enables to see the versioned models and have a full lineage to check which experiment was used to train a model (Figure 4).

When you drill into one of these experiments, you can see a whole bunch of details about the metrics, child runs (grouped together for different iteration related runs) with links to the individual models and the input datasets (Figure 5).

In the Datasets tab you can see all versioned data that we used to train our models(Figure 6).

This way, once deployed a model can be traced back to where it came from: the experiment that led to its creation, and from there the data that was used to train it.

Now that you are familiar with the components of AMLS, we will finally get to create a real-time service 😎.

3 Creating real-time diabetes diagnosis service

The real-time service model could be used by users to get prediction from new data instantaneously.

To be able to get this prediction the model has to be trained and served. By training we mean the process that includes the following steps:

connecting to the workspace & creating an experiment (in Azure)
loading the data
training & registering a model
monitoring logs, metrics, runs, …

Once a model has been trained and registered, we can serve it and use it in an inference pipeline to get the requested prediction. Then, a real-time inference pipeline adds web service inputs and outputs to handle requests, skipping training modules.

Below, the creation of the training and the inference pipelines in AMLS is illustrated on the diabetes dataset. The goal is to classify patients that need to be subjected to a clinical test for diabetes. In the production environment, this model could be used as a real-time service to predict whether a patient has a disease or not and give an immediate response, based on their clinical data.

3.1 Connect to your workspace

As said before, AMLS is an Azure workspace you can access from your local machine or from any other environment (a virtual machine for example). How can you get access to an AMLS workspace from Python🤔? Thanks to the azureml SDK you can create a connection to the remote AMLS workspace, so that you’ll be able to train and deploy your model:

import azureml.core
from azureml.core import Workspace# Load the workspace from the saved config file
ws = Workspace.from_config()

3.2 Train and register a model

Import all libraries we will need in order to train and register a model:

from azureml.core import Experiment
from azureml.core import Model
import pandas as pd
import numpy as np
import joblib
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import roc_auc_score
from sklearn.metrics import roc_curve

Create an Azure ML experiment in your Azure workspace in order to log metrics and save all information under the run :

# Create an Azure ML experiment in your workspace
experiment = Experiment(workspace = ws, name = "diabetes-training")
run = experiment.start_logging()

Load data and train your model:

# load the diabetes dataset
diabetes = pd.read_csv('data/diabetes.csv')# Separate features and labels
X, y = diabetes[['Pregnancies','PlasmaGlucose','DiastolicBloodPressure','TricepsThickness','SerumInsulin','BMI','DiabetesPedigree','Age']].values, diabetes['Diabetic'].values# Split data into training set and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=0)# Train a decision tree model
model = DecisionTreeClassifier().fit(X_train, y_train)

Calculate metrics and log them into you experiment in order to evaluate your model:

# calculate accuracy
y_hat = model.predict(X_test)
acc = np.average(y_hat == y_test)
run.log('Accuracy', np.float(acc))# calculate AUC
y_scores = model.predict_proba(X_test)
auc = roc_auc_score(y_test,y_scores[:,1])
run.log('AUC', np.float(auc))

Save your models and register them in order to be able to compare the different versions and choose the one that will be deployed in production:

# Save the trained model
model_file = 'diabetes_model.pkl'
joblib.dump(value=model, filename=model_file)
run.upload_file(name = 'outputs/' + model_file, path_or_stream = './' + model_file)# Complete the run
run.complete()# Register the model
run.register_model(model_path='outputs/diabetes_model.pkl', model_name='diabetes_model',
                   tags={'Training context':'Inline Training'},
                   properties={'AUC': run.get_metrics()['AUC'], 'Accuracy': run.get_metrics()['Accuracy']})

Once you have trained and registered your model you can see this experiment and its details in your AMLS workspace (Figure 4, 5).

NB: In AMLS we can publish a pipeline and see all the steps above separately (Figure 7). Here is an example which contains a data gathering step to get the data, an estimator step to train a model and a register step to save a model. This view could be automatically created with a python script (for more details https://github.com/MicrosoftDocs/mslearn-aml-labs/blob/master/05-Creating_a_Pipeline.ipynb).

3.3 Deploy a model as a web service

The deployment includes the following steps:

Define an inference configuration (scoring and environment files)
Define a deployment configuration (execution environment in which the service will be hosted, ex. ACI — Azure Container Instances)
Deploy the model as a web service.
Check the status of the service (healthy in case of successful deployment).

You can choose the version of the model you want to deploy from your workspace (for example, it could be the last version or the model with the best accuracy/AUC) and then create a web service to host this model.

The code below loads the input data, gets the model from the workspace, generates and returns predictions.

%%writefile $script_file
import json
import joblib
import numpy as np
from azureml.core.model import Model# Called when the service is loaded
def init():
    global model
    # Get the path to the deployed model file and load it
    model_path = Model.get_model_path('diabetes_model')
    model = joblib.load(model_path)# Called when a request is received
def run(raw_data):
    # Get the input data as a numpy array
    data = np.array(json.loads(raw_data)['data'])
    # Get a prediction from the model
    predictions = model.predict(data)
    # Get the corresponding classname for each prediction (0 or 1)
    classnames = ['not-diabetic', 'diabetic']
    predicted_classes = []
    for prediction in predictions:
        predicted_classes.append(classnames[prediction])
    # Return the predictions as JSON
    return json.dumps(predicted_classes)

This scoring script will be deployed into the created service which will be hosted in the container. The required dependencies (.yml file) to run this service will be installed as well:

from azureml.core.conda_dependencies import CondaDependencies# Add the dependencies for our model (AzureML defaults is already included)
myenv = CondaDependencies()
myenv.add_conda_package('scikit-learn')# Save the environment config as a .yml file
env_file = os.path.join(experiment_folder,"diabetes_env.yml")
with open(env_file,"w") as f:
    f.write(myenv.serialize_to_string())
print("Saved dependency info in", env_file)# Print the .yml file
with open(env_file,"r") as f:
    print(f.read())

Here is the saved .yml file:

And now it is time to create a web service:

from azureml.core.webservice import AciWebservice
from azureml.core.model import InferenceConfig# Configure the scoring environment
inference_config = InferenceConfig(runtime= "python",
                                   entry_script=script_file,
                                   conda_file=env_file)deployment_config = AciWebservice.deploy_configuration(cpu_cores = 1, memory_gb = 1)service_name = "diabetes-service"service = Model.deploy(ws, service_name, [model], inference_config, deployment_config)service.wait_for_deployment(True)

After the model is deployed successfully you can see the Endpoints with the deployed services in your AMLS workspace:

NB: ACI (Azure Container Instances) service requires no authentication, it is widely used in development environments… In the case of production, you should consider AKS (Azure Kubernetes Service) for deployment.

Hint: To check the status of a deployed service and get the logs use service.state and service.get_logs() commands.

4 Use the web service

Now the deployed diabetes service could be used by doctors from their software. We could call this service by identifying an endpoint to which HTTP requests will be submitted:

import requests
import jsonendpoint = service.scoring_urix_new = [[2,180,74,24,21,23.9091702,1.488172308,22],
         [0,148,58,11,179,39.19207553,0.160829008,45]]# Convert the array to a serializable list in a JSON document
input_json = json.dumps({"data": x_new})# Set the content type
headers = { 'Content-Type':'application/json' }predictions = requests.post(endpoint, input_json, headers = headers)
predicted_classes = json.loads(predictions.json())for i in range(len(x_new)):
    print ("Patient {}".format(x_new[i]), predicted_classes[i] )

As you can see from the code above, by sending observations as JSON documents we get the predicted classes (the patients having diabetes or not). The predictions help the doctor make decisions about the necessity to treat their patients.

Conclusion

We have seen in this article what is a pipeline, how it is useful especially in production, and the difference between training and inference pipeline.

We have learned how to train and deploy those pipelines in Azure ML, and the visual advantages that the Azure workspace offers to manage your assets 😎.

In the next articles of this series, you’ll see how to create and deploy the ML pipelines with Azure Databricks and MLFlow (👉part 2), and the global comparison between AMLS and Azure Databricks (👉part 3).