Azure Machine Learning Service for Kubernetes Architects: Deploy Your First Model on AKS with AZ CLI v2

13 min readFeb 29, 2024

In a previous post, I provided a lengthy write-up about my understanding of using Kubernetes as a compute target in Azure ML from a Kubernetes architect’s perspective. In this post, I will offer a step-by-step tutorial that teaches you how to deploy your first ML model on AKS. As a disclaimer, I am not a data scientist; however, I work with customers who deploy ML workloads on Kubernetes.

In this tutorial, we will deploy a trained regression model based on the MNIST Dataset, which consists of 60K handwritten digits for training and 10K for testing. The model was created using the scikit-learn framework. You can learn more about how to use it in Azure ML here. We will use Azure CLI v2 (az ml), but you can also use Python SDK v2.

Prerequisites

A machine learning workspace. You can learn how to create one here.
A Kubernetes cluster. You can learn how to create one here. At minimum, you need a system node pool. We will optionally create a dedicated node pool for this lab.
az ml CLI (v2). Learn how to upgrade or install it here.
Github Repository: All the scripts used in this lab are available in this repo.

Cloning the repository

The files used in this tutorial are available on my Github in this repo.

#change directories into your workspace folder

git clone https://github.com/jmasengeshomsft/azure-ml-on-aks-cli-step-by-step.git
#all the commands are executed in the home directory.

Explore the files and understand how everything is tied together.

model/conda.yml: Dependency files for the container image
model/sklearn_mnist_model.pkl: The actual sklearn format model file
script/score.py: The model scoring file
cli-scripts.sh: A list of the az cli commands that will be used (file not executable, just a list)
kubernetes-deployment.yml: A schema for the ML online deployment
kubernetes-endpoint.yml: A schema for the ML online endpoint
sample-request.json: Sample request body to be used to test the model

Setting Up Variables

COMPUTE_NAME="demo-k8s-compute"
RESOURCE_GROUP_NAME="aks-demos"
WORKSPACE_NAME="jm-ml"
CLUSTER_NAME="ml-sklearn-demo"
NAMESPACE="azureml-workloads"
CLUSTER_RESOURCE_ID=$(az aks show -n $CLUSTER_NAME -g $RESOURCE_GROUP_NAME  --query id --output tsv)
NODE_POOL_NAME="sklearnpool"
NODE_POOL_LABEL="purpose=ml-sklearn-demo"
NODE_COUNT=2  # Change as needed
VM_SIZE="Standard_D4ds_v5"  
MAX_PODS=110  # Change as needed
ENDPOINT_NAME="demo-sklearn-endpoint"
ENDPOINT_YAML_FILE="kubernetes-endpoint.yml"
DEPLOYMENT_NAME="demo-sklearn-deployment"
DEPLOYMENT_YAML_FILE="kubernetes-deployment.yml"

Creating a new node pool (optional)

If you prefer to use a new dedicated node pool for this lab, create a new node pool. The model we will deploy is very small; two instances of Standard_D4ds_v5 are more than enough. If you want to use your existing node pools, remember to remove the nodeSelector on the extension and instance type.


az aks nodepool add \
    --resource-group $RESOURCE_GROUP_NAME \
    --cluster-name $CLUSTER_NAME \
    --name $NODE_POOL_NAME \
    --node-count $NODE_COUNT \
    --node-vm-size $VM_SIZE \
    --max-pods $MAX_PODS \
    --labels $NODE_POOL_LABEL

Installing the Azure ML extension on AKS

If you are using an existing node pool, remove or update the nodeSelector attribute in the config parameter (nodeSelector.purpose=ml-sklearn-demo).


az k8s-extension create --name azureml \
                       --extension-type Microsoft.AzureML.Kubernetes \
                       --cluster-type managedClusters \
                       --cluster-name $CLUSTER_NAME \
                       --resource-group $RESOURCE_GROUP_NAME \
                       --scope cluster \
                       --config nodeSelector.purpose=ml-sklearn-demo installPromOp=False enableTraining=True enableInference=True inferenceRouterServiceType=loadBalancer internalLoadBalancerProvider=azure allowInsecureConnections=True inferenceRouterHA=False nginxIngress.controller="k8s.io/aml-ingress-nginx"

After several minutes, confirm that the extension was successfully installed. You should see the following deployments and services.

A few things to note:

We chose to deploy an internal loadbalancer for the inference router by inferenceRouterServiceType=loadBalancer and internalLoadBalancerProvider=azure. Confirm that the inference-fe service has an internal IP address. Omit internalLoadBalancerProvider if you want a publicly available IP for the inference router.
We installed nginx-ingress controller with a different class name nginxIngress.controller=”k8s.io/aml-ingress-nginx” to avoid conflict with any existing installation. In this lab, NGINX is only used to communicate with Azure RM. We could also use it to set up ingress for the inference router.

Deploy Azure Machine Learning extension on Kubernetes cluster - Azure Machine Learning

Learn about the Azure Machine Learning extension, available configuration settings and different deployment scenarios…

learn.microsoft.com

Attaching the Kubernetes compute target to the Machine Learning workspace

The next step is to add our AKS cluster as a compute target so that we can run the model inference server.


az ml compute attach --resource-group $RESOURCE_GROUP_NAME \
                     --workspace-name $WORKSPACE_NAME \
                     --type Kubernetes \
                     --name $COMPUTE_NAME \
                     --resource-id $CLUSTER_RESOURCE_ID \
                     --identity-type SystemAssigned \
                     --namespace $NAMESPACE \
                     --no-wait

Once this script is successful, we should be able to see a new compute added in the ML workspace.

Kubernetes Compute Target in ML Workspace

Learn more:

Attach a Kubernetes cluster to Azure Machine Learning workspace - Azure Machine Learning

Learn about how to attach a Kubernetes cluster

learn.microsoft.com

Creating a compute InstanceType (optional)

Instance types help Azure ML schedule workloads on predefined K8s node pools. They serve a similar purpose as nodeSelectors. You can read more about them here. When you deployed the extension, a default instance type called “defaultinstancetype” was created. If you prefer to use that, you can skip this section, but remember to update the instanceType in the deployment YAML schema.

apiVersion: amlarc.azureml.com/v1alpha1
kind: InstanceType
metadata:
  name: demo-sklearn-instance-type
spec:
  nodeSelector:
    purpose: ml-sklearn-demo
  resources:
    limits:
      cpu: "1"
      memory: "1Gi"
    requests:
      cpu: "500m"
      memory: "500Mi"

kubectl apply -f sklearn-instance-type.yaml

Create and manage instance types for efficient utilization of compute resources - Azure Machine…

Learn about what instance types are, how to create and manage them, and what the benefits of using them are.

learn.microsoft.com

Creating the ML workload namespace

The extension resources were created in a dedicated namespace called azureml. We do not want all the ML workloads to end up there. Lets create a new namespace for the model deployment.

kubectl create namespace azureml-workloads

Deploying Azure ML Kubernetes Online Endpoint

The next step is to create a kubernetes online endpoint which is an abstraction of the model inference server. Under one endpoint, we can deploy different versions of our model as “deployments”.

Endpoint YAML schema (kubernetes-endpoint.yml)

$schema: https://azuremlschemas.azureedge.net/latest/kubernetesOnlineEndpoint.schema.json
#To serve the online endpoint in Kubernetes, set the compute as your Kubernetes compute target. The legacy AKS compute is not supported. Learn more on Kubernetes compute here aka.ms/amlarc/doc.
name: demo-sklearn-endpoint
description: Skearn Kubernetes realtime endpoint.
compute: azureml:demo-k8s-compute
auth_mode: key
identity:
  type: system_assigned
tags:
  modelName: sklearn-mnist

Deploy the endpoint with Azure CLI using the schema above


az ml online-endpoint create --resource-group $RESOURCE_GROUP_NAME \
                             --workspace-name $WORKSPACE_NAME \
                             --file $ENDPOINT_YAML_FILE \
                             --local false \
                             --no-wait

Once this script succeeds, we should observe the following:

In Azure Portal

A new Azure resource will be created in the workspace resource group. You can see that it already created urls endpoints for scoring and swagger definition before we deploy the model. Very cool.

In ML Workspace

Under Endpoints, we will have a new entry. When you explore it you will see more details.

In Kubernetes

Confirm that the new online endpoint CRD was created in our namespace

kubectl get onlineEndpoint -n azureml-workloads


#You should get something like this
NAME                    AGE
demo-sklearn-endpoint   2m46s

Deploying the Azure ML Endpoint Deployment

Now that we have created an endpoint, it’s time to deploy the model inference server. The deployment will require model files and a container base environment before creation. We have two options:

We can pre-create both the model and environment in the Azure ML workspace or through SDKs.
Alternatively, we can upload model files and specify the base container environment in the deployment schema.

For simplicity, let’s create the model and environment during deployment creation.

Online Deployment YAML Schema (kubernetes-deployment.yml)

name: demo-sklearn-deployment
type: kubernetes
endpoint_name: demo-sklearn-endpoint
app_insights_enabled: true
model: 
  path: model/sklearn_mnist_model.pkl
code_configuration:
  code: script/
  scoring_script: score.py
instance_type: demo-sklearn-instance-type
environment:
  image: mcr.microsoft.com/azureml/openmpi4.1.0-ubuntu20.04
  conda_file: model/conda.yml
request_settings:
  request_timeout_ms: 3000
  max_queue_wait_ms: 3000
resources:
  requests:
    cpu: "0.1"
    memory: "0.1Gi"
  limits:
    cpu: "0.2"
    memory: "0.2Gi"
readiness_probe:
  failure_threshold: 30
  initial_delay: 10
  period: 10
  success_threshold: 1
  timeout: 2
liveness_probe:
  failure_threshold: 30
  initial_delay: 10
  period: 10
  success_threshold: 1
  timeout: 2
scale_settings:
  type: target_utilization
  min_instances: 1
  max_instances: 3
  polling_interval: 10
  target_utilization_percentage: 70
tags:
  endpointName: demo-sklearn-endpoint
  modelName: sklearn-mnist

Deploying the OnlineDeployment with CLI. To be able to upload the model and other related files, make sure the identity running az cli has the following role on the ML workspace’s storage account: Storage Blob Data Contributor.



az ml online-deployment create --file $DEPLOYMENT_YAML_FILE \
                               --resource-group $RESOURCE_GROUP_NAME \
                               --workspace-name $WORKSPACE_NAME \
                               --all-traffic \
                               --endpoint-name $ENDPOINT_NAME \
                               --local false \
                               --name $DEPLOYMENT_NAME \
                               --no-wait

This script will require several minutes to finish. If the environment does not exist in your workspace’s container registry, most of the time will be spent building the container and pushing it to the ACR. You can verify this by checking the Jobs section in the ML workspace. You will notice a new experiment named “prepare_image”.

When we run this script successfully, we should observe the following:

In Kubernetes

A OnlineDeployment CRD is created under our workload namespace


kubectl get onlinedeployment -n azureml-workloads

# will return something like this
NAME                                            AGE
demo-sklearn-deployment-demo-sklearn-endpoint   2m15s

A Kubernetes deployment, pod(s), service, and configmap

kubectl get all -n azureml-workloads

K8s native objects created by the ML online deployment

At this point everything we need to deploy on AKS to test the model is deployed. The diagram below shows the relationship between Azure ML CRDs and K8s objects that represent our deployment.

On the left, the ML CRDs create K8s deployment. On the right, the inference traffic is routed through the inference router.

Adjusting the Traffic Allocation to the newly created OnlineDeployment

When the onlineDeployment was created, the traffic allocation was set to 0%. We need to update the online endpoint to send 100% of the traffic to the deployment. If we have more than one deployment under the same endpoint, we can allocate traffic appropriately such as in a blue-green set up.

az ml online-endpoint update --resource-group $RESOURCE_GROUP_NAME \
                             --workspace-name $WORKSPACE_NAME \
                             --name $ENDPOINT_NAME \
                             --traffic "$DEPLOYMENT_NAME=100"

Once this scripts succeeds, you can check that the deployment is receiving 100% of the traffic. You can verify this in the ML workspace or through kubectl. Explore the trafficRules section under Spec (.spec.trafficRules).

kubectl describe onlineendpoint demo-sklearn-endpoint -n azureml-workloads

Testing The Model with Port-Forwarding

Ingress traffic to our ML models routes through the inference router, which was created during the extension creation. The service for the inference router is called “azureml-fe”. Typically, we would access the onlineDeployment workload through the endpoint score URL or ingress controller. However, for simplicity, we will use port-forwarding to access the model locally. Since our inference router is private, it’s easier to test with port-forwarding, anyway.

The scoring URL looks like this: http://<ip address>/api/v1/endpoint/demo-sklearn-endpoint/score. We will swap the local port with the IP address above. The rest of the url is the same.

We also need to get the authentication key available in the ML workspace: Endpoints>select endpoint> click on Consume. You can also get the auth key from kubectl when you describe the endpoint. (.spec.authKeys)

kubectl port-forward svc/azureml-fe 50491:80 --namespace azureml

Once port-forwarding is in place, you can test the endpoint with Postman:

Method: POST
Url: http://localhost:50491/api/v1/endpoint/demo-sklearn-endpoint/score
Authorization. Bearer Token where the token is the authentication key from the endpoint
Body: The content of sample-request.json

How it looks like in Postman:

As you can see, the sample input is an 8.

That's it, you are inferencing with a model deployed on AKS. The next sections explain how went from a yaml schema for the online-deployment to a Kubernetes pod serving the model. Read on!

How is the Inference Server Pod Created?

If we described the pod that was created from the online-deployment, we notice that it has three containers: identity-server, inference-server, and storageinitializer-modeldata (init-container)

# not actual manifest. Modified to focus on containers and volumes

apiVersion: v1
kind: Pod
metadata:
  name: demo-sklearn-deployment-demo-sklearn-endpoint-67997c4fc5-z8zpp
  namespace: azureml-workloads
spec:
  containers:
  - image: <acr name>.azurecr.io/azureml/azureml_<guid>
    name: inference-server
    ports:
    - containerPort: 5001
      protocol: TCP
    resources:
      limits:
        cpu: 200m
        memory: 214748364800m
      requests:
        cpu: 100m
        memory: 107374182400m
    volumeMounts:
    - mountPath: /var/azureml-app
      name: model-mount-0
      readOnly: true
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: kube-api-access-shfl2
      readOnly: true
  - image: mcr.microsoft.com/azureml/amlarc/docker/identity-sidecar:1.1.44
    name: identity-sidecar
    resources:
      limits:
        cpu: 100m
        memory: 50Mi
      requests:
        cpu: 100m
        memory: 50Mi
    volumeMounts:
    - mountPath: /identity
      name: identity-secret-volume
      readOnly: true
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: kube-api-access-shfl2
      readOnly: true
  initContainers:
  - image: mcr.microsoft.com/mir/mir-storageinitializer:46571814.1631244300887
    imagePullPolicy: IfNotPresent
    name: storageinitializer-modeldata
    resources:
      limits:
        cpu: 100m
        memory: 500Mi
      requests:
        cpu: 100m
        memory: 500Mi
    volumeMounts:
    - mountPath: /var/azureml-app
      name: model-mount-0
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: kube-api-access-shfl2
      readOnly: true
    - mountPath: /identity
      name: identity-secret-volume
      readOnly: true

This diagram shows how the inference server container and models files are deployed in Kubernetes.

Inference Pod start up flow:

1- When the model and deployment are created, files are stored into the storage account attached to the workspace
2- When the environment is specified (curated or custom), the inference server container is built and pushed to the container registry attached to the workspace
3- When the deployment is created, the identity controller injects a identity-server container into the model deployment to pull credentials needed to connect to the storage account
4- The storageinitializer init container connects to the storage account, pulls the model file, scoring files into an emtyDir volume for the inference server to consume.
5- When the inference pods is scheduled, the inference server container is pulled from the container registry
6- The inference server loads model files into the file system at /var/azureml-app
7- Once the inference server is ready, its traffic is served by the inference router through its service.

Building the Inference Server Container from Environment

When you specify a curated or custom environment, you are specifying which base base to use in a Dockerfile, to create the inference server container. The Conda file is used to install dependencies while creating the image. See the Dockerfile sample above.

FROM mcr.microsoft.com/azureml/openmpi4.1.0-ubuntu20.04
USER root
RUN mkdir -p $HOME/.cache
WORKDIR /
COPY azureml-environment-setup/99brokenproxy /etc/apt/apt.conf.d/
RUN if dpkg --compare-versions `conda --version | grep -oE '[^ ]+$'` lt 4.4.11; then conda install conda==4.4.11; fi
COPY azureml-environment-setup/mutated_conda_dependencies.yml azureml-environment-setup/mutated_conda_dependencies.yml
RUN ldconfig /usr/local/cuda/lib64/stubs && conda env create -p /azureml-envs/azureml_<guid> -f azureml-environment-setup/mutated_conda_dependencies.yml && rm -rf "$HOME/.cache/pip" && conda clean -aqy && CONDA_ROOT_DIR=$(conda info --root) && rm -rf "$CONDA_ROOT_DIR/pkgs" && find "$CONDA_ROOT_DIR" -type d -name __pycache__ -exec rm -rf {} + && ldconfig
ENV PATH /azureml-envs/azureml_<guid>/bin:$PATH
COPY azureml-environment-setup/send_conda_dependencies.py azureml-environment-setup/send_conda_dependencies.py
RUN echo "Copying environment context"
COPY azureml-environment-setup/environment_context.json azureml-environment-setup/environment_context.json
RUN python /azureml-environment-setup/send_conda_dependencies.py -p /azureml-envs/azureml_<guid>
ENV AZUREML_CONDA_ENVIRONMENT_PATH /azureml-envs/azureml_<guid>
ENV LD_LIBRARY_PATH /azureml-envs/azureml_<guid>/lib:$LD_LIBRARY_PATH
ENV CONDA_DEFAULT_ENV=azureml_<guid> CONDA_PREFIX=/azureml-envs/azureml_<guid>
COPY azureml-environment-setup/spark_cache.py azureml-environment-setup/log4j.properties /azureml-environment-setup/
RUN if [ $SPARK_HOME ]; then /bin/bash -c '$SPARK_HOME/bin/spark-submit  /azureml-environment-setup/spark_cache.py'; fi
RUN rm -rf azureml-environment-setup
ENV AZUREML_ENVIRONMENT_IMAGE True
CMD ["bash"]

Once the environment is created, an inference server container is created and pushed to the container registry attached to the ML workspace.

Where do we build the inference server container?

Depending on the network configuration, three scenarios represent where inference server container:

ACR Tasks: Public container registry; the inference server is built using ACR Tasks on managed public runners.
Private Docker build Compute: If the container registry is only accessible through private endpoint, ACR Tasks wont be able to reach your container registry. You must specify a compute cluster that would be used to run docker jobs with — image-build-compute when creating/updating the workspace. The compute cluster must be on subnet that has access to the container registry.
Serverless Compute on the same VNET: The private docker build compute will be replaced by serverless compute deployed on the same VNET. Read More.

Storage Initialization and Model files volume

The Storage Initializer downloads model and scoring scripts from the storage account and storage them in an emptyDir volume. The volume is mounted in Inference Server at /var/azureml-app

kubectl exec blue-sklearn-regression-do-644cd59847-zwsg7 -it -n azureml-arc -- sh

Once inside the pod shell,

#change directories into the model files directory
cd /var/azureml-app/azureml-models/<some string>/1
# once inside, notice the file you uploaded when you created the model

#change directories into the model files
cd /var/azureml-app/onlinescoring
# once inside, notice the files you uploaded when creating the online deployment

Model and scoring files in the inference server file system

Customizing the Inference Server

Resource Allocations and Probes

As we saw earlier, we do not create inference deployments directly through YAML. The ML controller creates the K8s deployment from the ML OnlineDeployment CRD. We configure the resource allocation on the OnlineDeployment Object. This CRD allows you to specify InstanceType, liveness and readiness probes, and resourceRequests.

# onlineDeployment yaml schema
# other properties skipped
resources:
  requests:
    cpu: "0.1"
    memory: "0.1Gi"
  limits:
    cpu: "0.2"
    memory: "0.2Gi"
readiness_probe:
  failure_threshold: 30
  initial_delay: 10
  period: 10
  success_threshold: 1
  timeout: 2
liveness_probe:
  failure_threshold: 30
  initial_delay: 10
  period: 10
  success_threshold: 1
  timeout: 2

Taints and Tolerations

Azure ML allows us to specify tolerations for built-in taints if being used. These were automatically added but you can customize them.

    # onlineDeployment yaml
    # other properties skipped
    tolerations:
      - key: ml.azure.com/amlarc
        operator: Equal
        value: "true"
      - key: ml.azure.com/amlarc-workload
        operator: Equal
        value: "true"
      - key: ml.azure.com/resource-group
        operator: Equal
        value: <rg-name>
      - key: ml.azure.com/workspace
        operator: Equal
        value: jm-ml
      - key: ml.azure.com/compute
        operator: Equal
        value: do-aks

Scaling The Inference Server

The OnlineDeployment CRD, allows us to specify scaleSettings for the inference pod, BUT, we must not enable autoscaling through Horizontal Pod Autoscaler — HPA or KEDA because the inference router discussed earlier takes care of model inference server scaling based on the incoming traffic. Read more on this link:

Inference router and connectivity requirements — Azure Machine Learning

Learn about what is Azure Machine Learning inference router, how autoscaling works, and how to configure and meet…

learn.microsoft.com

# onlineDeployment yaml
# other properties skipped
scaleSettings:
    maximumInstanceCount: 1
    minimumInstanceCount: 1
    refreshPeriodInSec: 1
    scaleType: Auto
    targetUtilization: 70

Azure Machine Learning Service for Kubernetes Architects: Deploy Your First Model on AKS with AZ CLI v2

Prerequisites

Cloning the repository

Setting Up Variables

Creating a new node pool (optional)

Installing the Azure ML extension on AKS

Deploy Azure Machine Learning extension on Kubernetes cluster - Azure Machine Learning

Learn about the Azure Machine Learning extension, available configuration settings and different deployment scenarios…

Attaching the Kubernetes compute target to the Machine Learning workspace

Attach a Kubernetes cluster to Azure Machine Learning workspace - Azure Machine Learning

Learn about how to attach a Kubernetes cluster

Creating a compute InstanceType (optional)

Create and manage instance types for efficient utilization of compute resources - Azure Machine…

Learn about what instance types are, how to create and manage them, and what the benefits of using them are.

Creating the ML workload namespace

Deploying Azure ML Kubernetes Online Endpoint

Deploying the Azure ML Endpoint Deployment

Adjusting the Traffic Allocation to the newly created OnlineDeployment

Testing The Model with Port-Forwarding

How is the Inference Server Pod Created?

Building the Inference Server Container from Environment

Storage Initialization and Model files volume

Customizing the Inference Server

Inference router and connectivity requirements — Azure Machine Learning

Learn about what is Azure Machine Learning inference router, how autoscaling works, and how to configure and meet…

Written by Joseph Masengesho