Real-time inferencing using Azure ML

Pradosh Thakur
Brillio Data Science
7 min readSep 27, 2022

Azure is a cloud platform by Microsoft to access different services with little cost and no dependency on on-premise infrastructure. The only requirement is a stable internet connection to connect to the user-friendly Azure Portal. With managed ML services and infrastructure, it is now much easier to develop and deploy from simplest to complex ML models and consume them from different endpoints.

Microsoft describes Azure ML as “Azure Machine Learning is a cloud service for accelerating and managing the machine learning project lifecycle. Machine learning professionals, data scientists, and engineers can use it in their day-to-day workflows: Train and deploy models, and manage ML Ops”. We can create a model in Azure Machine Learning or use a model built from an open-source platform, such as Pytorch, TensorFlow, or scikit-learn. ML Ops tools help you monitor, retrain, and redeploy models.

Below are the Azure components and their definition(as per Azure documentation) that gets created by default or are necessary while working with Azure ML

Resource Group: A container that holds related resources for an Azure solution. The resource group includes those resources that you want to manage as a group.

Virtual Network: A Virtual network connects virtual machines and devices.

Storage Accounts: An Azure storage account contains all of the storage for example blob, data lakes, files tables, etc. It is highly scalable and can easily be accessed. It can also be security enabled which will only authorize the person with credentials.

Machine Learning Workspace: The workspace is the top-level resource for Azure Machine Learning. It stores all the components of the Azure ML. It can store the experiment, history models, dataset, and datastore information.

Compute Instances: A compute instance is a fully configured and managed development environment in the cloud. They are used to develop the ML model and comes with different configuration and are scalable in nature.

Compute Cluster: Azure Machine Learning Compute (AmlCompute) is a managed-compute infrastructure that allows us to easily create a single or multi-node compute. The compute is created in the workspace and can be shared among users. It can also be used to run batch inferencing.

Kubernetes Cluster: Azure Kubernetes Cluster(AKS) is a managed cluster for real-time inferencing. It costs more because of the low latency and real-time features.

The components involved in Machine Learning on the Azure platform are as follows-

Workspace: The workspace is the top-level resource for Azure Machine Learning, providing a centralized place to work with all the artifacts you create when you use Azure ML

Experiment: An experiment is a grouping of many runs from a script. It always belongs to a workspace

Run: It is a single execution of a training script. Azure ML records all Runs & stores the below info

· Metadata i.e Timestamp and duration of Runs

· Metrics logged by the script

· Outputs files

Run Configuration: A set of instructions that define how a script should be run in a specified compute target. The configuration includes a wide set of behavior definitions, such as whether to use an existing Python environment or to use a Conda environment that’s built from a specification

Snapshots: When you submit a run, Azure Machine Learning compresses the directory that contains the script as a zip file and sends it to the compute target. The zip file is then extracted, and the script is run there. Azure Machine Learning also stores the zip file as a snapshot as part of the run record. Anyone with access to the workspace can browse a run record and download the snapshot.

I will now detail about how to deploy/inference trained ML model in two different ways

· Real-time inferencing

· Batch Inferencing

The focus of this blog will be on the inferencing mechanism rather than on training the ML model. Hence, I will not be covering how to train an ML model and assume that the reader is familiar with the basic ML concepts and has a basic knowledge of how cloud works and is familiar with the Azure portal and interface.

REAL-TIME INFERENCING:

Real-time, or interactive, inference is architecture where model inference can be triggered at any time, and an immediate response is expected.

Azure Machine Learning allows us to separate the deployment into two separate components so that we can keep the same code, but merely update the model.

To do the same first we have the register the model. When we register a model, we upload the model to the cloud (in the workspace’s default storage account) and then mount it to the same compute where your web service is running. As there might be some pre-processing(Min-Max Scaler or Standard Scaler, for example) that needs to be applied in the test set as well, we need to save the same as well.

The below code demonstrates how to register a model.

Once the model is registered it will show up under the models section of the azure ML page.

Next, we need to create an environment for the inferencing services. The Conda environment file specifies the dependencies for the service. It includes dependencies required by both the model and the entry script. We have to specify the python package required with the version number to have all the dependency right, it also makes sure the code doesn’t break due to future changes in the required packages. The code will run the specified environment, any new dependency has to be added below.

Next, we need to Create an AKS(Azure Kubernetes services) cluster. Azure Kubernetes Service (AKS) offers serverless Kubernetes, an integrated continuous integration and continuous delivery (CI/CD) experience, and enterprise-grade security and governance. It can help rapidly build, deliver and scale applications with confidence. It is faster and has low latency hence preferred for real-time inferencing. It also can scale up whenever required and provide a seamless experience. The below code will create a Kubernetes cluster with the provided name if not exist else it will use the existing AKS cluster.

Once the AKS cluster is successfully deployed it will appear under the compute section and under the inference cluster section as shown in the below image.

Next, we need to create a scoring script to process the unseen data. A scoring script needs to be created to process the unseen data. This script will import all the dependencies required. It has two functions 1: Init 2: Run

Init: It will initialize the global variable and load the model and required objects from the workspace. These objects will be initialized and used in the run function. If there is an error in Init function, inference will not happen, error logs have to be checked in the case.

Run: It will have all the steps similar to the training, the data will be pr processed similarly, cleaned, normalized, and will be ready for prediction. The model initialized in the Init() function will be used to predict the unseen value and it will be returned at the end of the script.

Next, we need to create the inference and deployment configuration for the web service. Inference configuration contains all the software-related information like what the name of the scoring script is and where it is located. It also has information about the python environment and the dependency to be used in the services and scoring script. The deployment configuration contains the hardware-related information for example what is the memory and CPU cores needed for the web service.

Next, we have to deploy the web service. Finally, the web services can be deployed, we have to provide the inference configuration, deployment configuration, model name, cluster name, etc. Once everything is correctly provided the web service will be created within a few minutes and can be seen in the azure portal.

Below is the screenshot of the web service and the end-point. This can be used as a REST-API service and can be consumed or real-time inferencing from any part of the world provided there is authorization. Inside the web service, end point URL and key can be found which can be used further in the apps to call and infer the model. The below screenshot shows the masked URL and the key.

The REST-API can be used from any web application(for ex flask-HTML-CSS or any program to run the real-time inference). This is how real-time inferencing can be done in Azure ML.

In the next article, we will look at how batch inferencing is done with Azure ML.

References:

--

--