Azure Machine Learning Service for Kubernetes Architects

19 min readFeb 26, 2024

In the dynamic field of data science and artificial intelligence (AI), the integration of Kubernetes with machine learning (ML) technologies presents promising opportunities. However, a notable gap persists between Kubernetes experts and the nuanced operations of ML platforms like Azure ML. Data scientists and AI practitioners frequently encounter challenges navigating Kubernetes’ complexities, whereas Kubernetes specialists may struggle with understanding Azure ML’s procedures for model creation, packaging, and deployment. Closing this gap is essential to fully leverage the capabilities of these technologies.

In this article, I’ll discuss insights gained from my exploration of the integration between Azure Machine Learning and Kubernetes, focusing specifically on the perspective of a Kubernetes architect.

About Azure Machine Learning and Kubernetes

Azure Machine Learning simplifies and accelerates the machine learning project lifecycle by providing a cloud-based platform for training, deploying, and managing models, compatible with open-source frameworks like PyTorch, TensorFlow, and scikit-learn, supported by MLOps tools for monitoring, retraining, and redeployment. Learn more.
Azure Kubernetes Service (AKS) streamlines Kubernetes cluster deployment in Azure by managing operational tasks like health monitoring and maintenance, providing a no-cost, automatically configured control plane abstracted from users, who solely manage and pay for attached nodes. Learn more.
Azure Arc-enabled Kubernetes enables attaching Kubernetes clusters from any location for centralized management and configuration in Azure, facilitating consistent development and operational experiences across diverse Kubernetes platforms, with SSL-secured outbound connections to Azure and representation as distinct resources in Azure Resource Manager for easy organization. Learn more.

Jump to the section:

Part 1: Why ML/AI Workloads love Kubernetes
Part 2: Azure ML Kubernetes Extension
Part 3: Attaching a Kubernetes Compute Target
Part 4: Azure ML Kubernetes Inference Router
Part 5: Understanding Azure ML Kubernetes CRDs
Part 6: Understanding Environments and Inference Server Container

Why do ML/AI workloads love Kubernetes?

Reasons why people are using Kubernetes to train and deploy AI workloads:

Scalability: Kubernetes provides seamless scaling capabilities, allowing AI workloads to dynamically adapt to changing demand.
Resource Efficiency: Kubernetes efficiently allocates resources, ensuring optimal utilization for AI training and inference tasks.
Portability: Kubernetes offers portability across various environments, enabling AI workloads to run consistently across on-premises, cloud, and hybrid environments.
Fault Tolerance: Kubernetes offers robust fault tolerance features, ensuring high availability and reliability for AI workloads.
Automation: Kubernetes automates deployment, scaling, and management of AI workloads, reducing manual intervention and enhancing productivity.
Separation of concerns: IT-operation/Kubernetes Architecture team is responsible for configuring and deploying AKS or Arc Kubernetes clusters, integrating Azure Machine Learning extensions, managing network and security configurations, instance types, and troubleshooting, utilizing tools like Azure CLI or kubectl. The Data-science team utilizes IT-provisioned compute resources for training or inference tasks, identifying and selecting available compute targets and instance types within the Azure Machine Learning workspace using preferred tools or APIs such as Azure Machine Learning CLI v2, Python SDK v2, or Studio UI.

Other Popular AI/ML Frameworks:

Besides Azure ML on Kubernetes, here are other popular AI/ML open-source frameworks:

Kubeflow: Simplifies ML workflows on Kubernetes with support for various frameworks and components, though installation and configuration can pose challenges and not all ML libraries may be supported.
Feast: Offers consistent feature storage and serving for ML models on Kubernetes, integrating with popular frameworks, yet it may be complex to use and maintain, and support for all data sources and formats may be lacking.
KServe: Provides standardized API endpoints for ML model deployment and management on Kubernetes, supporting multiple serving platforms and offering features like model fetching and observability, though it may have limitations in functionality and framework support.
OpenML: Facilitates ML experiment sharing and collaboration on Kubernetes, supporting AutoML tools and frameworks, yet integration challenges and incomplete support for certain tasks and datasets may arise.
Volcano: Enables high-performance workload execution on Kubernetes, featuring powerful batch scheduling capabilities, scalability, and usability, but compatibility issues with some Kubernetes resources and incomplete support for certain workloads may be encountered.
Ray: An open-source unified framework for scaling AI and Python applications like machine learning. It provides the compute layer for parallel processing so that you don’t need to be a distributed systems expert.
Kaito: An operator that automates the AI/ML inference model deployment in a Kubernetes cluster. The target models are popular large open-sourced inference models such as falcon and llama 2.

Kubernetes Compute Target in Azure ML

With Azure Machine Learning CLI/Python SDK v2, Azure Machine Learning introduced a new compute target — Kubernetes compute target. You can easily enable an existing Azure Kubernetes Service (AKS) cluster or Azure Arc-enabled Kubernetes (Arc Kubernetes) cluster to become a Kubernetes compute target in Azure Machine Learning, and use it to train or deploy models. (Azure Learn)

Understand compute targets - Azure Machine Learning

Learn how to designate a compute resource or environment to train or deploy your model with Azure Machine Learning.

learn.microsoft.com

Azure Machine Learning Kubernetes compute supports two types of Kubernetes clusters:

AKS clusters within Azure, offering security and compliance controls along with flexibility for managing ML workloads.
Arc Kubernetes clusters outside of Azure, enabling model training or deployment across diverse infrastructures.

Azure Machine Learning Kubernetes Extension and Inference Router

In order to use Kubernetes as a compute target, we need to install the extension. The extension installs components and CRDs:

ML controllers and operators: aml-operator, volcano, inference-operator-controller-manager
Networking: gateway, azureml-fe-v2, nginx-ingress controller, relayserver
Monitoring: metric-controller, prometheus, and fluent-bit,
Identity Management: identity-controller and identity-proxy
Miscellaneous: nvidia-plugin deamonset, nvidia dcgm (data center gpu manager)
Custom Resource Definitions (CRDs): amlJob, Identity, InstanceType, Metrics, OnlineEndpoint, OnlineDeployment.

Deploy Azure Machine Learning extension on Kubernetes cluster - Azure Machine Learning

Learn about the Azure Machine Learning extension, available configuration settings and different deployment scenarios…

learn.microsoft.com

Resources created for an Arc-Enabled Kubernetes cluster

Azure ML Extension deployments and Services with An Arc-Enabled Kubernetes (Digital Ocean)

Resources created for Azure Kubernetes Services -AKS

Azure ML Extension deployments and Services with AKS. Notice a private IP for the inference server. Also notice that we enabled ingress controller for this installation.

Below are notable custom resources definitions that get installed. Others like volcano are installed too.

Read more about these components on Microsoft Learn:

Deploy Azure Machine Learning extension on Kubernetes cluster - Azure Machine Learning

Learn about the Azure Machine Learning extension, available configuration settings and different deployment scenarios…

learn.microsoft.com

Installing the Azure ML Extension

The Azure ML extension is deployed like all other extensions for AKS and Arc enabled Kubernetes.

With Azure CLI:

az k8s-extension create --name azureml \
          --extension-type Microsoft.AzureML.Kubernetes \
          --cluster-type connectedClusters \
          --cluster-name <cluster name>\
          --resource-group <cluster resource group>\
          --scope cluster
          --config installPromOp=false 
                   enableTraining=True 
                   enableInference=True 
                   inferenceRouterServiceType=LoadBalancer 
                   allowInsecureConnections=True 
                   inferenceRouterHA=False

With Bicep/ARM:

resource azureml 'Microsoft.KubernetesConfiguration/extensions@2022-11-01' = {
  name: 'string'
  scope: resourceSymbolicName
  identity: {
    type: 'SystemAssigned'
  }
  plan: {
    name: 'string'
    product: 'string'
    promotionCode: 'string'
    publisher: 'string'
    version: 'string'
  }
  properties: {
    aksAssignedIdentity: {
      type: 'string'
    }
    autoUpgradeMinorVersion: bool
    configurationProtectedSettings: {}
    configurationSettings: {}
    extensionType: 'string'
    releaseTrain: 'string'
    scope: {
      cluster: {
        releaseNamespace: 'string'
      }
      namespace: {
        targetNamespace: 'string'
      }
    }
    statuses: [
      {
        code: 'string'
        displayStatus: 'string'
        level: 'string'
        message: 'string'
        time: 'string'
      }
    ]
    version: 'string'
  }
}

To understand all the configurations and their purposes, check out this documentation:

Deploy Azure Machine Learning extension on Kubernetes cluster - Azure Machine Learning

Learn about the Azure Machine Learning extension, available configuration settings and different deployment scenarios…

learn.microsoft.com

Azure Machine Learning Inference Router and Connectivity Configurations

Azure Machine Learning inference router is the front-end component (azureml-fe) which is deployed on AKS or Arc Kubernetes cluster at Azure Machine Learning extension deployment time (Microsoft Learn).

What does the Inference Router do?

Routes incoming inference requests from cluster load balancer or ingress controller to corresponding model pods.
Load-balance all incoming inference requests with smart coordinated routing.
Manages model pods auto-scaling.
Fault-tolerant and failover capability, ensuring inference requests is always served for critical business application.

The following steps are how requests are processed by the front-end:

Client sends request to the load balancer.
Load balancer sends to one of the front-ends.
The front-end locates the service router (the front-end instance acting as coordinator) for the service.
The service router selects a back-end and returns it to the front-end.
The front-end forwards the request to the back-end.
After the request has been processed, the back-end sends a response to the front-end component.
The front-end propagates the response back to the client.
The front-end informs the service router that the back-end has finished processing and is available for other requests.

Traffic flow for Azure ML inference with the Inference route. (Microsoft Learn)

Inference router and connectivity requirements - Azure Machine Learning

Learn about what is Azure Machine Learning inference router, how autoscaling works, and how to configure and meet…

learn.microsoft.com

Connectivity Requirement for AKS Inference Cluster:

The following diagram shows the connectivity requirements for AKS inferencing. Black arrows represent actual communication, and blue arrows represent the domain names. You may need to add entries for these hosts to your firewall or to your custom DNS server. Learn More.

AKS Inference Router Connectivity Requirements (Microsoft Learn)

Right after azureml-fe is deployed, it will attempt to start and this requires to:

Resolve DNS for AKS API server
Query AKS API server to discover other instances of itself (it’s a multi-pod service)
Connect to other instances of itself

Once azureml-fe is started, it requires the following connectivity to function properly:

Connect to Azure Storage to download dynamic configuration
Resolve DNS for Microsoft Entra authentication server api.azureml.ms and communicate with it when the deployed service uses Microsoft Entra authentication.
Query AKS API server to discover deployed models
Communicate to deployed model PODs

At model deployment time, for a successful model deployment AKS node should be able to:

Resolve DNS for customer’s ACR
Download images from customer’s ACR
Resolve DNS for Azure BLOBs where model is stored
Download models from Azure BLOBs

After the model is deployed and service starts, azureml-fe will automatically discover it using AKS API, and will be ready to route request to it. It must be able to communicate to model PODs. Follow the links below to explore advanced configurations for the inference router:

Secure inferencing environment: Secure Azure Kubernetes Service inferencing environment — Azure Machine Learning | Microsoft Learn
Configure TLS/SSL: Configure a secure online endpoint with TLS/SSL — Azure Machine Learning | Microsoft Learn
Troubleshoot installation: Troubleshoot Azure Machine Learning extension — Azure Machine Learning | Microsoft Learn
Customize Ingress Controller: Troubleshoot Azure Machine Learning extension — Azure Machine Learning | Microsoft Learn

Attaching a Kubernetes Cluster to an Azure ML Workspace

Attaching a Kubernetes cluster to Azure Machine Learning workspace can flexibly support many different scenarios. For example, the shared scenarios with multiple attachments, model training scripts accessing Azure resources, and the authentication configuration of the workspace. (Microsoft Learn)

One cluster to one workspace, creating multiple compute targets

For the same Kubernetes cluster, you can attach it to the same workspace multiple times and create multiple compute targets for different projects/teams/workloads.

One cluster to multiple workspaces

For the same Kubernetes cluster, you can also attach it to multiple workspaces, and the multiple workspaces can share the same Kubernetes cluster.

If you plan to have different compute targets for different projects/teams, you can specify the existed Kubernetes namespace in your cluster for the compute target to isolate workload among different teams/projects.

One Cluster can be connected to one or more workloads

Cluster and Workspaces in different Subscriptions

There is a known limitation for ML on AKS: the K8s cluster and the Azure ML workspace should live in the same subscriptions. If you have AKS cluster in a different subscription, you can use Azure Arc to create a connected cluster in the same sub as the workspace. The architecture diagram below shows three workspaces connected to one arc cluster.

If the cluster is in a different subscription, use Azure Arc to create a connected cluster in the workspace’s subscription

Learn more about how to Arc-enable a cluster:

Quickstart: Connect an existing Kubernetes cluster to Azure Arc - Azure Arc

In this quickstart, you learn how to connect an Azure Arc-enabled Kubernetes cluster.

learn.microsoft.com

Attaching a cluster to a workspace:

With Azure CLI

# Set up variables 

NAME="demo-sklearn"
RESOURCE_GROUP_NAME="value"
WORKSPACE_NAME="value"
CLUSTER_NAME="value"
NAMESPACE="value"
CLUSTER_RESOURCE_ID=$(az aks show -n $CLUSTER_NAME -g $RESOURCE_GROUP_NAME  --query id --output tsv)

#attach the cluster with az ml 
az ml compute attach --resource-group $RESOURCE_GROUP_NAME \
                     --workspace-name $WORKSPACE_NAME \
                     --type Kubernetes \
                     --name $NAME \
                     --resource-id $CLUSTER_RESOURCE_ID \
                     --identity-type SystemAssigned \
                     --namespace $NAMESPACE \
                     --no-wait

With Python SDK v2

from azure.ai.ml import load_compute

compute_name = "demo-compute"
aks_resource_id = "<cluster resource id>"

# for arc connected cluster, the resource_id should be something like '/subscriptions/<SUBSCRIPTION_ID>/resourceGroups/<RESOURCE_GROUP>/providers/Microsoft.ContainerService/connectedClusters/<CLUSTER_NAME>''
compute_params = [
    {"name": compute_name},
    {"type": "kubernetes"},
    {
        "resource_id": aks_resource_id,
    },
    {"namespace": "azureml-workloads"},
    {"description": "Demo compute for AML"}
]
k8s_compute = load_compute(source=None, params_override=compute_params)

compute_list = {c.name: c.type for c in ml_client.compute.list()}

if compute_name not in compute_list or compute_list[compute_name] != "kubernetes":
    ml_client.begin_create_or_update(k8s_compute).result()
else:
    print("Compute already exists")

from azureml.core.compute import KubernetesCompute
from azureml.core.compute import ComputeTarget
import os

# choose a name for your Azure Arc-enabled Kubernetes compute
amlarc_compute_name = os.environ.get("AMLARC_COMPUTE_NAME", "demo-compute")
training_namespace = os.environ.get("AMLARC_COMPUTE_NAMESPACE","azureml_workloads")
cluster_resource_id = os.environ.get("AMLARC_CLUSTER_RESOURCE_ID", "<AKS/ARC CLUSTER RESOURCE ID>")

# resource ID for your Azure Arc-enabled Kubernetes cluster
resource_id = os.environ.get("ARC_CLUSTER_RESOURCE_ID", cluster_resource_id)

if amlarc_compute_name in ws.compute_targets:
    amlarc_compute = ws.compute_targets[amlarc_compute_name]
    if amlarc_compute and type(amlarc_compute) is KubernetesCompute:
        print("found compute target: " + amlarc_compute_name)
else:
    print("creating new compute target...")
    amlarc_attach_configuration = KubernetesCompute.attach_configuration(resource_id, training_namespace, "SystemAssigned") 
    amlarc_compute = ComputeTarget.attach(ws, amlarc_compute_name, amlarc_attach_configuration)

 
    amlarc_compute.wait_for_completion(show_output=True)
    
     # For a more detailed view of current KubernetesCompute status, use get_status()
    print(amlarc_compute.get_status().serialize())

Through the Azure Portal

Instance Types

Instance types are an Azure Machine Learning concept that allows targeting certain types of compute nodes for training and inference workloads. For example, in an Azure virtual machine, an instance type is STANDARD_D2_V3. This article teaches you how to create and manage instance types for your computation requirements. (Microsoft Learn)

apiVersion: amlarc.azureml.com/v1alpha1
kind: InstanceType
metadata:
  name: myinstancetypename
spec:
  nodeSelector:
    mylabel: mylabelvalue
  resources:
    limits:
      cpu: "1"
      nvidia.com/gpu: 1
      memory: "2Gi"
    requests:
      cpu: "700m"
      memory: "1500Mi"

Create and manage instance types for efficient utilization of compute resources - Azure Machine…

Learn about what instance types are, how to create and manage them, and what the benefits of using them are.

learn.microsoft.com

Deploy Your first ML Model on AKS

To deploy your first model on AKS, visit these sample projects:

Inference:

Training:

https://github.com/Azure/AML-Kubernetes/tree/master/examples/training

Private AKS & Private ML Workspace:

https://github.com/Azure/azure-quickstart-templates/tree/master/quickstarts/microsoft.machinelearningservices/machine-learning-end-to-end-secure

Understanding Azure ML Kubernetes CRDs

amlarc.azureml.com/v1alpha1/OnlineEndpoint

apiVersion: amlarc.azureml.com/v1alpha1
kind: OnlineEndpoint
metadata:
  name: sklearn-regression-do
  namespace: azureml-arc
spec:
  authKeys:
    primaryKey: <key 1 value>
    secondaryKey: <key 2 value>
  authMode: Key
  computeTarget: /subscriptions/subId/resourceGroups/rgName/workspaces/workspaceName/computes/computeName
  location: centralus
  managedIdentity:
    clientId: <client id>
    type: SystemAssigned
  resourceId: <azure rm resource id>
  trafficRules:
  - deploymentName: blue-sklearn-regression-do
    percent: 100
  workspaceId: <workspace id>
  scoringUri: http://<IP Address>/api/v1/endpoint/sklearn-regression-do/score

How to create endpoint resource:

Python SDK V2


#Set u client
credential = AzureCliCredential()

ml_client = MLClient(
    credential, subscription_id, resource_group, workspace
)

#endpoint name
online_endpoint_name = "sklearn-regression"

# create an online endpoint
endpoint = KubernetesOnlineEndpoint(
    name=online_endpoint_name,
    compute=compute_name,
    description="this is a sample online endpoint",
    auth_mode="key",
    tags={"foo": "bar"}
)

#create the client
ml_client.begin_create_or_update(endpoint).result()

CLI SDK v2

#Sample file
$schema: https://azuremlschemas.azureedge.net/latest/managedOnlineEndpoint.schema.json
name: my-uai-endpoint
auth_mode: key
identity:
  type: user_assigned
  user_assigned_identities:
    - resource_id: user_identity_ARM_id_place_holder


az ml online-endpoint create --resource-group
                             --workspace-name
                             [--auth-mode]
                             [--file]   #the name of the file in the currenty directory or full path
                             [--local {false, true}]
                             [--name]
                             [--no-wait]
                             [--set]
                             [--web]

Bicep

https://learn.microsoft.com/en-us/azure/templates/microsoft.machinelearningservices/workspaces/onlineendpoints?pivots=deployment-language-bicep

amlarc.azureml.com/v1alpha1/OnlineDeployment

apiVersion: amlarc.azureml.com/v1alpha1
kind: OnlineDeployment
metadata:
  name: blue-sklearn-regression-do
  namespace: azureml-arc
spec:
  inferenceServerConfiguration:
    servingContainer:
      artifacts:
      - format: manifest
        mountPath: /var/azureml-app
        name: modeldata
        storageUri: <blob storage url>
      command:
      - runsvdir
      - /var/runit
      environmentVariables:
        AML_APP_ROOT: /var/azureml-app/onlinescoring
        AZUREML_ENTRY_SCRIPT: score.py
        AZUREML_MODEL_DIR: <value>
        SERVICE_NAME: sklearn-regression-do
        SERVICE_PATH_PREFIX: api/v1/endpoint/sklearn-regression-do
      image: <acr name>.azurecr.io/azureml/azureml_<image name>
      livenessProbe:
        failureThreshold: 30
        httpMethod: GET
        initialDelaySeconds: 10
        path: /
        periodSeconds: 10
        port: 5001
        scheme: HTTP
        successThreshold: 1
        timeoutSeconds: 2
      maxConcurrentRequestsPerContainer: 1
      name: inference-server
      ports:
      - portNumber: 5001
        protocol: TCP
      readinessProbe:
        failureThreshold: 30
        httpMethod: GET
        initialDelaySeconds: 10
        path: /
        periodSeconds: 10
        port: 5001
        scheme: HTTP
        successThreshold: 1
        timeoutSeconds: 2
      resourceRequests:
        cpu: 100m
        cpuLimit: "2"
        memory: 0.5Gi
        memoryLimit: 8Gi
  maxQueueWaitMs: 500
  resourceId: <azure rm resource id for the deployment>
    maximumInstanceCount: 1
    minimumInstanceCount: 1
    refreshPeriodInSec: 1
    scaleType: Auto
    targetUtilization: 70
  desiredReplicas: 1
  observedGeneration: 1
  upToDateReplicas: 1

Creating Online Deployment:

Python SDK V2

model = Model(path="../model-1/model/sklearn_regression_model.pkl")
env = Environment(
    conda_file="../model-1/environment/conda.yaml",
    image="mcr.microsoft.com/azureml/openmpi4.1.0-ubuntu20.04",
)

blue_deployment = KubernetesOnlineDeployment(
    name="blue",
    endpoint_name=online_endpoint_name,
    model=model,
    environment=env,
    code_configuration=CodeConfiguration(
        code="../model-1/onlinescoring", scoring_script="score.py"
    ),
    instance_count=1,
    resources=ResourceRequirementsSettings(
        requests=ResourceSettings(
            cpu="100m",
            memory="0.5Gi",
        ),
    ),
)

ml_client.begin_create_or_update(blue_deployment)

CLI SDK V2


#sample file
$schema: https://azuremlschemas.azureedge.net/latest/managedOnlineDeployment.schema.json
name: blue
model:
  path: ../../model-1/model/
code_configuration:
  code: ../../model-1/onlinescoring/
  scoring_script: score_managedidentity.py
environment:
  conda_file: ../../model-1/environment/conda-managedidentity.yaml
  image: mcr.microsoft.com/azureml/openmpi4.1.0-ubuntu20.04:latest
instance_type: Standard_DS3_v2
instance_count: 1
environment_variables:
  STORAGE_ACCOUNT_NAME: "storage_place_holder"
  STORAGE_CONTAINER_NAME: "container_place_holder"
  FILE_NAME: "file_place_holder"

az ml online-deployment create --file
                               --resource-group
                               --workspace-name
                               [--all-traffic]
                               [--endpoint-name]
                               [--local {false, true}]
                               [--local-enable-gpu {false, true}]
                               [--name]
                               [--no-wait]
                               [--package-model]
                               [--set]
                               [--skip-script-validation]
                               [--vscode-debug {false, true}]
                               [--web]

CLI SDK Yaml Schema: https://learn.microsoft.com/en-us/azure/machine-learning/reference-yaml-deployment-managed-online?view=azureml-api-2

amlarc.azureml.com/v1alpha1/Identity

apiVersion: amlarc.azureml.com/v1alpha1
kind: Identity
metadata:
  generation: 1
  name: blue-sklearn-regression
  namespace: sklearn-aks-ns
spec:
  computeID: /subscriptions/subId/resourceGroups/rgName/workspaces/workspaceName/computes/computeName
  - acrServers:
    - <acr name>.azurecr.io
    clientID: <client id>
    primary: true
    resourceID: ""
  serviceAccount: blue-sklearn-regression
  workloadID: <resource id>
  workspaceAPIHost: https://<guid>.workspace.eastus.api.azureml.ms
status:
  acrSecretStatuses:
  - acrSecret:
      acrServer: <acr name>.azurecr.io
      clientID: <client id>
      name: blue-sklearn-regression-<acr name>.azurecr.io
    expiration: "2024-02-20T15:50:17Z"
  conditions:
  - lastTransitionTime: "2024-02-20T12:50:17Z"
    status: "True"
    type: ACRSecretsReady
  - lastTransitionTime: "2024-02-20T12:50:17Z"
    status: "True"
    type: Ready
  - lastTransitionTime: "2024-02-20T12:50:17Z"
    status: "True"
    type: ServiceAccountReady
  - lastTransitionTime: "2024-02-20T12:50:16Z"
    status: "True"
    type: SidecarSecretReady
  observedGeneration: 1
  serviceAccountStatus:
    serviceAccount: blue-sklearn-regression
  sidecarSecretStatus:
    expiration: "2024-02-21T12:50:16Z"
    sidecarSecret: blue-sklearn-regression-sidecar

The identity is automatically created when we create a deployment.

ML model Kubernetes deployment

apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    autoscale.enabled: "true"
    autoscale.max_replicas: "1"
    autoscale.min_replicas: "1"
    autoscale.refresh_period_in_sec: "1"
    autoscale.target_utilization: "70"
    azuremlappname: blue-sklearn-regression-do
    isazuremlapp: "true"
    ml.azure.com/compute: do-aks
    ml.azure.com/deployment-name: blue
    ml.azure.com/endpoint-name: sklearn-regression-do
    ml.azure.com/identity: blue-sklearn-regression-do
    ml.azure.com/resource-group: <rg name>
    ml.azure.com/scrape-metrics: "true"
    ml.azure.com/subscription-id: <sub id>
    ml.azure.com/workspace: jm-ml
  name: blue-sklearn-regression-do
  namespace: azureml-arc
spec:
  minReadySeconds: 10
  progressDeadlineSeconds: 600
  replicas: 1
  selector:
    matchLabels:
      azuremlappname: blue-sklearn-regression-do
      isazuremlapp: "true"
  strategy:
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 1
    type: RollingUpdate
  template:
    metadata:
      labels:
        azuremlappname: blue-sklearn-regression-do
        isazuremlapp: "true"
        ml.azure.com/compute: do-aks
        ml.azure.com/deployment-name: blue
        ml.azure.com/endpoint-name: sklearn-regression-do
        ml.azure.com/identity: blue-sklearn-regression-do
        ml.azure.com/resource-group: <rg name>
        ml.azure.com/scrape-metrics: "true"
        ml.azure.com/subscription-id: <sub id>
        ml.azure.com/workspace: <workspace name>
    spec:
      automountServiceAccountToken: true
      containers:
      - command:
        - runsvdir
        - /var/runit
        env:
        - name: AML_APP_ROOT
          value: /var/azureml-app/onlinescoring
        - name: AZUREML_ENTRY_SCRIPT
          value: score.py
        - name: AZUREML_MODEL_DIR
          value: /var/azureml-app/azureml-models/<guid>/1
        - name: SERVICE_NAME
          value: sklearn-regression-do
        - name: SERVICE_PATH_PREFIX
          value: api/v1/endpoint/sklearn-regression-do
        image: <acr name>.azurecr.io/azureml/<image repo name>
        imagePullPolicy: IfNotPresent
        livenessProbe:
          failureThreshold: 30
          httpGet:
            path: /
            port: 5001
            scheme: HTTP
          initialDelaySeconds: 10
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 2
        name: inference-server
        ports:
        - containerPort: 5001
          protocol: TCP
        readinessProbe:
          failureThreshold: 30
          httpGet:
            path: /
            port: 5001
            scheme: HTTP
          initialDelaySeconds: 10
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 2
        resources:
          limits:
            cpu: "2"
            memory: 8Gi
          requests:
            cpu: 100m
            memory: 512Mi
        securityContext:
          privileged: false
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /var/azureml-app
          name: model-mount-0
          readOnly: true
      dnsPolicy: ClusterFirst
      initContainers:
      - env:
        - name: STORAGE_MANIFEST_URL
          value: <url>
        - name: STORAGE_DOWNLOAD_PATH
          value: /var/azureml-app
        - name: STORAGE_CREDENTIAL_CLIENTID
          value: <value>
        image: mcr.microsoft.com/mir/mir-storageinitializer:46571814.1631244300887
        imagePullPolicy: IfNotPresent
        name: storageinitializer-modeldata
        resources:
          limits:
            cpu: 100m
            memory: 500Mi
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /var/azureml-app
          name: model-mount-0
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext: {}
      serviceAccount: blue-sklearn-regression-do
      serviceAccountName: blue-sklearn-regression-do
      terminationGracePeriodSeconds: 30
      tolerations:
      - key: ml.azure.com/amlarc
        operator: Equal
        value: "true"
      - key: ml.azure.com/amlarc-workload
        operator: Equal
        value: "true"
      - key: ml.azure.com/resource-group
        operator: Equal
        value: <rg-name>
      - key: ml.azure.com/workspace
        operator: Equal
        value: jm-ml
      - key: ml.azure.com/compute
        operator: Equal
        value: do-aks
      volumes:
      - emptyDir: {}
        name: model-mount-0

Model Kubernetes Service

apiVersion: v1
kind: Service
metadata:
  creationTimestamp: "2024-02-16T16:10:00Z"
  labels:
    autoscale.enabled: "true"
    autoscale.max_replicas: "1"
    autoscale.min_replicas: "1"
    autoscale.refresh_period_in_sec: "1"
    autoscale.target_utilization: "70"
    azuremlappname: blue-sklearn-regression-do
    isazuremlapp: "true"
    max_concurrent_requests_per_container: "1"
    max_queue_wait_ms: "500"
    ml.azure.com/compute: do-aks
    ml.azure.com/deployment-name: blue
    ml.azure.com/endpoint-name: sklearn-regression-do
    ml.azure.com/identity: blue-sklearn-regression-do
    ml.azure.com/resource-group: <rg-name>
    ml.azure.com/scrape-metrics: "true"
    ml.azure.com/subscription-id: <sub id>
    ml.azure.com/workspace: jm-ml
    request_timeout_ms: "5000"
    routing_algorithm: CoordinatedLeastLoaded
  name: blue-sklearn-regression-do
  namespace: azureml-arc
spec:
  internalTrafficPolicy: Cluster
  ipFamilies:
  - IPv4
  ipFamilyPolicy: SingleStack
  ports:
  - port: 80
    protocol: TCP
    targetPort: 5001
  selector:
    azuremlappname: blue-sklearn-regression-do
  sessionAffinity: None
  type: ClusterIP

Model Endpoint Kubernetes ConfigMap

apiVersion: v1
kind: ConfigMap
metadata:
  labels:
    ABTest: "true"
  name: sklearn-regression-do
  namespace: azureml-arc
data:
  aadAuthEnabled: "false"
  endpoint: sklearn-regression-do
  keyAuthEnabled: "true"
  primaryKey: <key 1 value>
  secondaryKey: <key 2 value>
  versions: '[{"name":"blue-sklearn-regression-do","trafficPercentile":100}]'
  workspace.id: <workspace id>
  workspace.region: centralus

Environments And Inference Server Containers in Azure Machine Learning Service

By investigating the online deployment pod, we notice that the container that runs our model, also known as inference server comes from the container registry we attached to the ML workspace. How did it get created?

Creating the Deployment Inference Server

Model and Scoring files: When we create the model into the workspace studio, model files are stored in the workspace storage account. You can find model files under Data>Datastores>workspaceblobstore>WebUpload or simply browse the storage account containers.
Environment and Base Images: When you specify a curated or custom environment, you are specifying which base base to use in a Dockerfile, to create the inference server container. Conda files are used to install dependencies while creating the image. See the Dockerfile sample above.

FROM mcr.microsoft.com/azureml/openmpi4.1.0-ubuntu20.04
USER root
RUN mkdir -p $HOME/.cache
WORKDIR /
COPY azureml-environment-setup/99brokenproxy /etc/apt/apt.conf.d/
RUN if dpkg --compare-versions `conda --version | grep -oE '[^ ]+$'` lt 4.4.11; then conda install conda==4.4.11; fi
COPY azureml-environment-setup/mutated_conda_dependencies.yml azureml-environment-setup/mutated_conda_dependencies.yml
RUN ldconfig /usr/local/cuda/lib64/stubs && conda env create -p /azureml-envs/azureml_<guid> -f azureml-environment-setup/mutated_conda_dependencies.yml && rm -rf "$HOME/.cache/pip" && conda clean -aqy && CONDA_ROOT_DIR=$(conda info --root) && rm -rf "$CONDA_ROOT_DIR/pkgs" && find "$CONDA_ROOT_DIR" -type d -name __pycache__ -exec rm -rf {} + && ldconfig
ENV PATH /azureml-envs/azureml_<guid>/bin:$PATH
COPY azureml-environment-setup/send_conda_dependencies.py azureml-environment-setup/send_conda_dependencies.py
RUN echo "Copying environment context"
COPY azureml-environment-setup/environment_context.json azureml-environment-setup/environment_context.json
RUN python /azureml-environment-setup/send_conda_dependencies.py -p /azureml-envs/azureml_<guid>
ENV AZUREML_CONDA_ENVIRONMENT_PATH /azureml-envs/azureml_<guid>
ENV LD_LIBRARY_PATH /azureml-envs/azureml_<guid>/lib:$LD_LIBRARY_PATH
ENV CONDA_DEFAULT_ENV=azureml_<guid> CONDA_PREFIX=/azureml-envs/azureml_<guid>
COPY azureml-environment-setup/spark_cache.py azureml-environment-setup/log4j.properties /azureml-environment-setup/
RUN if [ $SPARK_HOME ]; then /bin/bash -c '$SPARK_HOME/bin/spark-submit  /azureml-environment-setup/spark_cache.py'; fi
RUN rm -rf azureml-environment-setup
ENV AZUREML_ENVIRONMENT_IMAGE True
CMD ["bash"]

Once the environment is created, an inference server container is created and pushed to the container registry attached to the ML workspace.

Where do we build the inference server container?

Depending on the network configuration, three scenarios represent where inference server container:

ACR Tasks: Public container registry: the inference server is built using ACR Tasks on managed public runners.
Private Docker build Compute: If the container registry is only accessible through private endpoint, ACR Tasks wont be able to reach your container. You must specify a compute cluster what would be used to run docker job with — image-build-compute when creating/updating the workspace. The compute cluster must be on subnet that has access to the container registry.
Serverless Compute on the same VNET: The private docker build compute will be replaced by serverless compute deployed on the same VNET. Read More.

Storage Initialization and Model files volume

ML OnlineEndpoint Deployment: When we create an online endpoint deployment, we are asking the ML platform to create a Kubernetes deployment with: Inference Server (main container), Identity Server (sidecar), and Storage Initializer (init container)
The Storage Initializer downloads model and scoring scripts from the storage account and storage them in an emptyDir volume. The volume is mounted in Inference Server at /var/azureml-app

kubectl exec <pod name> -it -n azureml-arc -- sh

Once inside the pod shell,

#change directories into the model files directory
cd /var/azureml-app/azureml-models/<some string>/1
# once inside, notice the file you uploaded when you created the model 

#change directories into the model files
cd /var/azureml-app/onlinescoring
# once inside, notice the files you uploaded when creating the online deployment

Model and scoring files in the inference server file system

Inference Pod start up flow:

1- When the model and deployment are created, files are storage into the storage account attached to the workspace
2- When the environment is specified (curated or custom), the inference server container is built and pushed to the container registry attached to the workspace
3- When the deployment is created, the identity controllers injects a identity-server container into the model deployment to pull credentials needed to connect to the storage account
4- The storageinitializer init container connects to the storage account, pulls the model file, scoring files into an emtyDir volume for the inference server to consume.
5- When the inference pods is scheduled, the inference server container is pulled from the container registry
6- The inference server loads model files into the file system at /var/azureml-app
7- Once the inference server is ready, its traffic is served by the inference router through its service.

Customizing the Inference Server

Resource Allocations and Probes

As we saw earlier, we do not create inference deployments directly through YAML. The ML controller creates the K8s deployment from the ML OnlineDeployment CRD. We configure the resource allocation on the OnlineDeployment Object. This CRD allows you to specify InstanceType, liveness and readiness probes, and resourceRequests.


     # onlineDeployment yaml
     # other properties skipped

     livenessProbe:
        failureThreshold: 30
        httpMethod: GET
        initialDelaySeconds: 10
        path: /
        periodSeconds: 10
        port: 5001
        scheme: HTTP
        successThreshold: 1
        timeoutSeconds: 2
      readinessProbe:
        failureThreshold: 30
        httpMethod: GET
        initialDelaySeconds: 10
        path: /
        periodSeconds: 10
        port: 5001
        scheme: HTTP
        successThreshold: 1
        timeoutSeconds: 2
      resourceRequests:
        cpu: 100m
        cpuLimit: "2"
        memory: 0.5Gi
        memoryLimit: 8Gi

Taints and Tolerations

Azure ML allows us to specify tolerations for built-in taints if being used.

    
    # onlineDeployment yaml
    # other properties skipped
    tolerations:
      - key: ml.azure.com/amlarc
        operator: Equal
        value: "true"
      - key: ml.azure.com/amlarc-workload
        operator: Equal
        value: "true"
      - key: ml.azure.com/resource-group
        operator: Equal
        value: <rg-name>
      - key: ml.azure.com/workspace
        operator: Equal
        value: jm-ml
      - key: ml.azure.com/compute
        operator: Equal
        value: do-aks

Scaling The Inference Server

The OnlineDeployment CRD, allows us to specify scaleSettings for the inference pod, BUT, we must not enable autoscaling through Horizontal Pod Autoscaler - HPA or KEDA because the inference router discussed earlier takes care of model inference server scaling based on the incoming traffic. Read more on this link:

Inference router and connectivity requirements - Azure Machine Learning

Learn about what is Azure Machine Learning inference router, how autoscaling works, and how to configure and meet…

learn.microsoft.com


# onlineDeployment yaml
# other properties skipped
scaleSettings:
    maximumInstanceCount: 1
    minimumInstanceCount: 1
    refreshPeriodInSec: 1
    scaleType: Auto
    targetUtilization: 70

Azure Machine Learning Service for Kubernetes Architects

About Azure Machine Learning and Kubernetes

Why do ML/AI workloads love Kubernetes?

Other Popular AI/ML Frameworks:

Kubernetes Compute Target in Azure ML

Understand compute targets - Azure Machine Learning

Learn how to designate a compute resource or environment to train or deploy your model with Azure Machine Learning.

Azure Machine Learning Kubernetes Extension and Inference Router

Deploy Azure Machine Learning extension on Kubernetes cluster - Azure Machine Learning

Learn about the Azure Machine Learning extension, available configuration settings and different deployment scenarios…

Deploy Azure Machine Learning extension on Kubernetes cluster - Azure Machine Learning

Learn about the Azure Machine Learning extension, available configuration settings and different deployment scenarios…

Installing the Azure ML Extension

Deploy Azure Machine Learning extension on Kubernetes cluster - Azure Machine Learning

Learn about the Azure Machine Learning extension, available configuration settings and different deployment scenarios…

Azure Machine Learning Inference Router and Connectivity Configurations

Inference router and connectivity requirements - Azure Machine Learning

Learn about what is Azure Machine Learning inference router, how autoscaling works, and how to configure and meet…

Attaching a Kubernetes Cluster to an Azure ML Workspace

Quickstart: Connect an existing Kubernetes cluster to Azure Arc - Azure Arc

In this quickstart, you learn how to connect an Azure Arc-enabled Kubernetes cluster.

Instance Types

Create and manage instance types for efficient utilization of compute resources - Azure Machine…

Learn about what instance types are, how to create and manage them, and what the benefits of using them are.

Deploy Your first ML Model on AKS

Understanding Azure ML Kubernetes CRDs

amlarc.azureml.com/v1alpha1/OnlineEndpoint

amlarc.azureml.com/v1alpha1/OnlineDeployment

Environments And Inference Server Containers in Azure Machine Learning Service

Creating the Deployment Inference Server

Storage Initialization and Model files volume

Customizing the Inference Server

Inference router and connectivity requirements - Azure Machine Learning

Learn about what is Azure Machine Learning inference router, how autoscaling works, and how to configure and meet…

Written by Joseph Masengesho