Azure Machine Learning Service for Kubernetes Architects

Joseph Masengesho
19 min readFeb 26, 2024

--

In the dynamic field of data science and artificial intelligence (AI), the integration of Kubernetes with machine learning (ML) technologies presents promising opportunities. However, a notable gap persists between Kubernetes experts and the nuanced operations of ML platforms like Azure ML. Data scientists and AI practitioners frequently encounter challenges navigating Kubernetes’ complexities, whereas Kubernetes specialists may struggle with understanding Azure ML’s procedures for model creation, packaging, and deployment. Closing this gap is essential to fully leverage the capabilities of these technologies.

In this article, I’ll discuss insights gained from my exploration of the integration between Azure Machine Learning and Kubernetes, focusing specifically on the perspective of a Kubernetes architect.

About Azure Machine Learning and Kubernetes

  • Azure Machine Learning simplifies and accelerates the machine learning project lifecycle by providing a cloud-based platform for training, deploying, and managing models, compatible with open-source frameworks like PyTorch, TensorFlow, and scikit-learn, supported by MLOps tools for monitoring, retraining, and redeployment. Learn more.
  • Azure Kubernetes Service (AKS) streamlines Kubernetes cluster deployment in Azure by managing operational tasks like health monitoring and maintenance, providing a no-cost, automatically configured control plane abstracted from users, who solely manage and pay for attached nodes. Learn more.
  • Azure Arc-enabled Kubernetes enables attaching Kubernetes clusters from any location for centralized management and configuration in Azure, facilitating consistent development and operational experiences across diverse Kubernetes platforms, with SSL-secured outbound connections to Azure and representation as distinct resources in Azure Resource Manager for easy organization. Learn more.

Jump to the section:

Why do ML/AI workloads love Kubernetes?

Reasons why people are using Kubernetes to train and deploy AI workloads:

  • Scalability: Kubernetes provides seamless scaling capabilities, allowing AI workloads to dynamically adapt to changing demand.
  • Resource Efficiency: Kubernetes efficiently allocates resources, ensuring optimal utilization for AI training and inference tasks.
  • Portability: Kubernetes offers portability across various environments, enabling AI workloads to run consistently across on-premises, cloud, and hybrid environments.
  • Fault Tolerance: Kubernetes offers robust fault tolerance features, ensuring high availability and reliability for AI workloads.
  • Automation: Kubernetes automates deployment, scaling, and management of AI workloads, reducing manual intervention and enhancing productivity.
  • Separation of concerns: IT-operation/Kubernetes Architecture team is responsible for configuring and deploying AKS or Arc Kubernetes clusters, integrating Azure Machine Learning extensions, managing network and security configurations, instance types, and troubleshooting, utilizing tools like Azure CLI or kubectl. The Data-science team utilizes IT-provisioned compute resources for training or inference tasks, identifying and selecting available compute targets and instance types within the Azure Machine Learning workspace using preferred tools or APIs such as Azure Machine Learning CLI v2, Python SDK v2, or Studio UI.

Other Popular AI/ML Frameworks:

Besides Azure ML on Kubernetes, here are other popular AI/ML open-source frameworks:

  • Kubeflow: Simplifies ML workflows on Kubernetes with support for various frameworks and components, though installation and configuration can pose challenges and not all ML libraries may be supported.
  • Feast: Offers consistent feature storage and serving for ML models on Kubernetes, integrating with popular frameworks, yet it may be complex to use and maintain, and support for all data sources and formats may be lacking.
  • KServe: Provides standardized API endpoints for ML model deployment and management on Kubernetes, supporting multiple serving platforms and offering features like model fetching and observability, though it may have limitations in functionality and framework support.
  • OpenML: Facilitates ML experiment sharing and collaboration on Kubernetes, supporting AutoML tools and frameworks, yet integration challenges and incomplete support for certain tasks and datasets may arise.
  • Volcano: Enables high-performance workload execution on Kubernetes, featuring powerful batch scheduling capabilities, scalability, and usability, but compatibility issues with some Kubernetes resources and incomplete support for certain workloads may be encountered.
  • Ray: An open-source unified framework for scaling AI and Python applications like machine learning. It provides the compute layer for parallel processing so that you don’t need to be a distributed systems expert.
  • Kaito: An operator that automates the AI/ML inference model deployment in a Kubernetes cluster. The target models are popular large open-sourced inference models such as falcon and llama 2.

Kubernetes Compute Target in Azure ML

With Azure Machine Learning CLI/Python SDK v2, Azure Machine Learning introduced a new compute target — Kubernetes compute target. You can easily enable an existing Azure Kubernetes Service (AKS) cluster or Azure Arc-enabled Kubernetes (Arc Kubernetes) cluster to become a Kubernetes compute target in Azure Machine Learning, and use it to train or deploy models. (Azure Learn)

Azure ML Compute Targets

Azure Machine Learning Kubernetes compute supports two types of Kubernetes clusters:

  • AKS clusters within Azure, offering security and compliance controls along with flexibility for managing ML workloads.
  • Arc Kubernetes clusters outside of Azure, enabling model training or deployment across diverse infrastructures.

Azure Machine Learning Kubernetes Extension and Inference Router

In order to use Kubernetes as a compute target, we need to install the extension. The extension installs components and CRDs:

  • ML controllers and operators: aml-operator, volcano, inference-operator-controller-manager
  • Networking: gateway, azureml-fe-v2, nginx-ingress controller, relayserver
  • Monitoring: metric-controller, prometheus, and fluent-bit,
  • Identity Management: identity-controller and identity-proxy
  • Miscellaneous: nvidia-plugin deamonset, nvidia dcgm (data center gpu manager)
  • Custom Resource Definitions (CRDs): amlJob, Identity, InstanceType, Metrics, OnlineEndpoint, OnlineDeployment.

Resources created for an Arc-Enabled Kubernetes cluster

Azure ML Extension deployments and Services with An Arc-Enabled Kubernetes (Digital Ocean)

Resources created for Azure Kubernetes Services -AKS

Azure ML Extension deployments and Services with AKS. Notice a private IP for the inference server. Also notice that we enabled ingress controller for this installation.

Below are notable custom resources definitions that get installed. Others like volcano are installed too.

Azure ML CRDs.

Read more about these components on Microsoft Learn:

Installing the Azure ML Extension

The Azure ML extension is deployed like all other extensions for AKS and Arc enabled Kubernetes.

With Azure CLI:

az k8s-extension create --name azureml \
--extension-type Microsoft.AzureML.Kubernetes \
--cluster-type connectedClusters \
--cluster-name <cluster name>\
--resource-group <cluster resource group>\
--scope cluster
--config installPromOp=false
enableTraining=True
enableInference=True
inferenceRouterServiceType=LoadBalancer
allowInsecureConnections=True
inferenceRouterHA=False

With Bicep/ARM:

resource azureml 'Microsoft.KubernetesConfiguration/extensions@2022-11-01' = {
name: 'string'
scope: resourceSymbolicName
identity: {
type: 'SystemAssigned'
}
plan: {
name: 'string'
product: 'string'
promotionCode: 'string'
publisher: 'string'
version: 'string'
}
properties: {
aksAssignedIdentity: {
type: 'string'
}
autoUpgradeMinorVersion: bool
configurationProtectedSettings: {}
configurationSettings: {}
extensionType: 'string'
releaseTrain: 'string'
scope: {
cluster: {
releaseNamespace: 'string'
}
namespace: {
targetNamespace: 'string'
}
}
statuses: [
{
code: 'string'
displayStatus: 'string'
level: 'string'
message: 'string'
time: 'string'
}
]
version: 'string'
}
}

To understand all the configurations and their purposes, check out this documentation:

Azure Machine Learning Inference Router and Connectivity Configurations

Azure Machine Learning inference router is the front-end component (azureml-fe) which is deployed on AKS or Arc Kubernetes cluster at Azure Machine Learning extension deployment time (Microsoft Learn).

What does the Inference Router do?

  • Routes incoming inference requests from cluster load balancer or ingress controller to corresponding model pods.
  • Load-balance all incoming inference requests with smart coordinated routing.
  • Manages model pods auto-scaling.
  • Fault-tolerant and failover capability, ensuring inference requests is always served for critical business application.

The following steps are how requests are processed by the front-end:

  1. Client sends request to the load balancer.
  2. Load balancer sends to one of the front-ends.
  3. The front-end locates the service router (the front-end instance acting as coordinator) for the service.
  4. The service router selects a back-end and returns it to the front-end.
  5. The front-end forwards the request to the back-end.
  6. After the request has been processed, the back-end sends a response to the front-end component.
  7. The front-end propagates the response back to the client.
  8. The front-end informs the service router that the back-end has finished processing and is available for other requests.
Traffic flow for Azure ML inference with the Inference route. (Microsoft Learn)

Connectivity Requirement for AKS Inference Cluster:

The following diagram shows the connectivity requirements for AKS inferencing. Black arrows represent actual communication, and blue arrows represent the domain names. You may need to add entries for these hosts to your firewall or to your custom DNS server. Learn More.

AKS Inference Router Connectivity Requirements (Microsoft Learn)

Right after azureml-fe is deployed, it will attempt to start and this requires to:

  • Resolve DNS for AKS API server
  • Query AKS API server to discover other instances of itself (it’s a multi-pod service)
  • Connect to other instances of itself

Once azureml-fe is started, it requires the following connectivity to function properly:

  • Connect to Azure Storage to download dynamic configuration
  • Resolve DNS for Microsoft Entra authentication server api.azureml.ms and communicate with it when the deployed service uses Microsoft Entra authentication.
  • Query AKS API server to discover deployed models
  • Communicate to deployed model PODs

At model deployment time, for a successful model deployment AKS node should be able to:

  • Resolve DNS for customer’s ACR
  • Download images from customer’s ACR
  • Resolve DNS for Azure BLOBs where model is stored
  • Download models from Azure BLOBs

After the model is deployed and service starts, azureml-fe will automatically discover it using AKS API, and will be ready to route request to it. It must be able to communicate to model PODs. Follow the links below to explore advanced configurations for the inference router:

Attaching a Kubernetes Cluster to an Azure ML Workspace

Attaching a Kubernetes cluster to Azure Machine Learning workspace can flexibly support many different scenarios. For example, the shared scenarios with multiple attachments, model training scripts accessing Azure resources, and the authentication configuration of the workspace. (Microsoft Learn)

One cluster to one workspace, creating multiple compute targets

  • For the same Kubernetes cluster, you can attach it to the same workspace multiple times and create multiple compute targets for different projects/teams/workloads.

One cluster to multiple workspaces

  • For the same Kubernetes cluster, you can also attach it to multiple workspaces, and the multiple workspaces can share the same Kubernetes cluster.

If you plan to have different compute targets for different projects/teams, you can specify the existed Kubernetes namespace in your cluster for the compute target to isolate workload among different teams/projects.

One Cluster can be connected to one or more workloads

Cluster and Workspaces in different Subscriptions

There is a known limitation for ML on AKS: the K8s cluster and the Azure ML workspace should live in the same subscriptions. If you have AKS cluster in a different subscription, you can use Azure Arc to create a connected cluster in the same sub as the workspace. The architecture diagram below shows three workspaces connected to one arc cluster.

If the cluster is in a different subscription, use Azure Arc to create a connected cluster in the workspace’s subscription

Learn more about how to Arc-enable a cluster:

Attaching a cluster to a workspace:

With Azure CLI

# Set up variables 

NAME="demo-sklearn"
RESOURCE_GROUP_NAME="value"
WORKSPACE_NAME="value"
CLUSTER_NAME="value"
NAMESPACE="value"
CLUSTER_RESOURCE_ID=$(az aks show -n $CLUSTER_NAME -g $RESOURCE_GROUP_NAME --query id --output tsv)

#attach the cluster with az ml
az ml compute attach --resource-group $RESOURCE_GROUP_NAME \
--workspace-name $WORKSPACE_NAME \
--type Kubernetes \
--name $NAME \
--resource-id $CLUSTER_RESOURCE_ID \
--identity-type SystemAssigned \
--namespace $NAMESPACE \
--no-wait

With Python SDK v2

from azure.ai.ml import load_compute

compute_name = "demo-compute"
aks_resource_id = "<cluster resource id>"

# for arc connected cluster, the resource_id should be something like '/subscriptions/<SUBSCRIPTION_ID>/resourceGroups/<RESOURCE_GROUP>/providers/Microsoft.ContainerService/connectedClusters/<CLUSTER_NAME>''
compute_params = [
{"name": compute_name},
{"type": "kubernetes"},
{
"resource_id": aks_resource_id,
},
{"namespace": "azureml-workloads"},
{"description": "Demo compute for AML"}
]
k8s_compute = load_compute(source=None, params_override=compute_params)

compute_list = {c.name: c.type for c in ml_client.compute.list()}

if compute_name not in compute_list or compute_list[compute_name] != "kubernetes":
ml_client.begin_create_or_update(k8s_compute).result()
else:
print("Compute already exists")

OR

from azureml.core.compute import KubernetesCompute
from azureml.core.compute import ComputeTarget
import os

# choose a name for your Azure Arc-enabled Kubernetes compute
amlarc_compute_name = os.environ.get("AMLARC_COMPUTE_NAME", "demo-compute")
training_namespace = os.environ.get("AMLARC_COMPUTE_NAMESPACE","azureml_workloads")
cluster_resource_id = os.environ.get("AMLARC_CLUSTER_RESOURCE_ID", "<AKS/ARC CLUSTER RESOURCE ID>")

# resource ID for your Azure Arc-enabled Kubernetes cluster
resource_id = os.environ.get("ARC_CLUSTER_RESOURCE_ID", cluster_resource_id)

if amlarc_compute_name in ws.compute_targets:
amlarc_compute = ws.compute_targets[amlarc_compute_name]
if amlarc_compute and type(amlarc_compute) is KubernetesCompute:
print("found compute target: " + amlarc_compute_name)
else:
print("creating new compute target...")
amlarc_attach_configuration = KubernetesCompute.attach_configuration(resource_id, training_namespace, "SystemAssigned")
amlarc_compute = ComputeTarget.attach(ws, amlarc_compute_name, amlarc_attach_configuration)


amlarc_compute.wait_for_completion(show_output=True)

# For a more detailed view of current KubernetesCompute status, use get_status()
print(amlarc_compute.get_status().serialize())

Through the Azure Portal

Instance Types

Instance types are an Azure Machine Learning concept that allows targeting certain types of compute nodes for training and inference workloads. For example, in an Azure virtual machine, an instance type is STANDARD_D2_V3. This article teaches you how to create and manage instance types for your computation requirements. (Microsoft Learn)

apiVersion: amlarc.azureml.com/v1alpha1
kind: InstanceType
metadata:
name: myinstancetypename
spec:
nodeSelector:
mylabel: mylabelvalue
resources:
limits:
cpu: "1"
nvidia.com/gpu: 1
memory: "2Gi"
requests:
cpu: "700m"
memory: "1500Mi"

Deploy Your first ML Model on AKS

To deploy your first model on AKS, visit these sample projects:

Inference:

Training:

Private AKS & Private ML Workspace:

Understanding Azure ML Kubernetes CRDs

amlarc.azureml.com/v1alpha1/OnlineEndpoint

apiVersion: amlarc.azureml.com/v1alpha1
kind: OnlineEndpoint
metadata:
name: sklearn-regression-do
namespace: azureml-arc
spec:
authKeys:
primaryKey: <key 1 value>
secondaryKey: <key 2 value>
authMode: Key
computeTarget: /subscriptions/subId/resourceGroups/rgName/workspaces/workspaceName/computes/computeName
location: centralus
managedIdentity:
clientId: <client id>
type: SystemAssigned
resourceId: <azure rm resource id>
trafficRules:
- deploymentName: blue-sklearn-regression-do
percent: 100
workspaceId: <workspace id>
scoringUri: http://<IP Address>/api/v1/endpoint/sklearn-regression-do/score

How to create endpoint resource:

Python SDK V2


#Set u client
credential = AzureCliCredential()

ml_client = MLClient(
credential, subscription_id, resource_group, workspace
)

#endpoint name
online_endpoint_name = "sklearn-regression"

# create an online endpoint
endpoint = KubernetesOnlineEndpoint(
name=online_endpoint_name,
compute=compute_name,
description="this is a sample online endpoint",
auth_mode="key",
tags={"foo": "bar"}
)

#create the client
ml_client.begin_create_or_update(endpoint).result()

CLI SDK v2

#Sample file
$schema: https://azuremlschemas.azureedge.net/latest/managedOnlineEndpoint.schema.json
name: my-uai-endpoint
auth_mode: key
identity:
type: user_assigned
user_assigned_identities:
- resource_id: user_identity_ARM_id_place_holder


az ml online-endpoint create --resource-group
--workspace-name
[--auth-mode]
[--file] #the name of the file in the currenty directory or full path
[--local {false, true}]
[--name]
[--no-wait]
[--set]
[--web]

Bicep

https://learn.microsoft.com/en-us/azure/templates/microsoft.machinelearningservices/workspaces/onlineendpoints?pivots=deployment-language-bicep

amlarc.azureml.com/v1alpha1/OnlineDeployment

apiVersion: amlarc.azureml.com/v1alpha1
kind: OnlineDeployment
metadata:
name: blue-sklearn-regression-do
namespace: azureml-arc
spec:
inferenceServerConfiguration:
servingContainer:
artifacts:
- format: manifest
mountPath: /var/azureml-app
name: modeldata
storageUri: <blob storage url>
command:
- runsvdir
- /var/runit
environmentVariables:
AML_APP_ROOT: /var/azureml-app/onlinescoring
AZUREML_ENTRY_SCRIPT: score.py
AZUREML_MODEL_DIR: <value>
SERVICE_NAME: sklearn-regression-do
SERVICE_PATH_PREFIX: api/v1/endpoint/sklearn-regression-do
image: <acr name>.azurecr.io/azureml/azureml_<image name>
livenessProbe:
failureThreshold: 30
httpMethod: GET
initialDelaySeconds: 10
path: /
periodSeconds: 10
port: 5001
scheme: HTTP
successThreshold: 1
timeoutSeconds: 2
maxConcurrentRequestsPerContainer: 1
name: inference-server
ports:
- portNumber: 5001
protocol: TCP
readinessProbe:
failureThreshold: 30
httpMethod: GET
initialDelaySeconds: 10
path: /
periodSeconds: 10
port: 5001
scheme: HTTP
successThreshold: 1
timeoutSeconds: 2
resourceRequests:
cpu: 100m
cpuLimit: "2"
memory: 0.5Gi
memoryLimit: 8Gi
maxQueueWaitMs: 500
resourceId: <azure rm resource id for the deployment>
maximumInstanceCount: 1
minimumInstanceCount: 1
refreshPeriodInSec: 1
scaleType: Auto
targetUtilization: 70
desiredReplicas: 1
observedGeneration: 1
upToDateReplicas: 1

Creating Online Deployment:

Python SDK V2

model = Model(path="../model-1/model/sklearn_regression_model.pkl")
env = Environment(
conda_file="../model-1/environment/conda.yaml",
image="mcr.microsoft.com/azureml/openmpi4.1.0-ubuntu20.04",
)

blue_deployment = KubernetesOnlineDeployment(
name="blue",
endpoint_name=online_endpoint_name,
model=model,
environment=env,
code_configuration=CodeConfiguration(
code="../model-1/onlinescoring", scoring_script="score.py"
),
instance_count=1,
resources=ResourceRequirementsSettings(
requests=ResourceSettings(
cpu="100m",
memory="0.5Gi",
),
),
)

ml_client.begin_create_or_update(blue_deployment)

CLI SDK V2


#sample file
$schema: https://azuremlschemas.azureedge.net/latest/managedOnlineDeployment.schema.json
name: blue
model:
path: ../../model-1/model/
code_configuration:
code: ../../model-1/onlinescoring/
scoring_script: score_managedidentity.py
environment:
conda_file: ../../model-1/environment/conda-managedidentity.yaml
image: mcr.microsoft.com/azureml/openmpi4.1.0-ubuntu20.04:latest
instance_type: Standard_DS3_v2
instance_count: 1
environment_variables:
STORAGE_ACCOUNT_NAME: "storage_place_holder"
STORAGE_CONTAINER_NAME: "container_place_holder"
FILE_NAME: "file_place_holder"

az ml online-deployment create --file
--resource-group
--workspace-name
[--all-traffic]
[--endpoint-name]
[--local {false, true}]
[--local-enable-gpu {false, true}]
[--name]
[--no-wait]
[--package-model]
[--set]
[--skip-script-validation]
[--vscode-debug {false, true}]
[--web]

CLI SDK Yaml Schema: https://learn.microsoft.com/en-us/azure/machine-learning/reference-yaml-deployment-managed-online?view=azureml-api-2

amlarc.azureml.com/v1alpha1/Identity

apiVersion: amlarc.azureml.com/v1alpha1
kind: Identity
metadata:
generation: 1
name: blue-sklearn-regression
namespace: sklearn-aks-ns
spec:
computeID: /subscriptions/subId/resourceGroups/rgName/workspaces/workspaceName/computes/computeName
- acrServers:
- <acr name>.azurecr.io
clientID: <client id>
primary: true
resourceID: ""
serviceAccount: blue-sklearn-regression
workloadID: <resource id>
workspaceAPIHost: https://<guid>.workspace.eastus.api.azureml.ms
status:
acrSecretStatuses:
- acrSecret:
acrServer: <acr name>.azurecr.io
clientID: <client id>
name: blue-sklearn-regression-<acr name>.azurecr.io
expiration: "2024-02-20T15:50:17Z"
conditions:
- lastTransitionTime: "2024-02-20T12:50:17Z"
status: "True"
type: ACRSecretsReady
- lastTransitionTime: "2024-02-20T12:50:17Z"
status: "True"
type: Ready
- lastTransitionTime: "2024-02-20T12:50:17Z"
status: "True"
type: ServiceAccountReady
- lastTransitionTime: "2024-02-20T12:50:16Z"
status: "True"
type: SidecarSecretReady
observedGeneration: 1
serviceAccountStatus:
serviceAccount: blue-sklearn-regression
sidecarSecretStatus:
expiration: "2024-02-21T12:50:16Z"
sidecarSecret: blue-sklearn-regression-sidecar

The identity is automatically created when we create a deployment.

ML model Kubernetes deployment

apiVersion: apps/v1
kind: Deployment
metadata:
labels:
autoscale.enabled: "true"
autoscale.max_replicas: "1"
autoscale.min_replicas: "1"
autoscale.refresh_period_in_sec: "1"
autoscale.target_utilization: "70"
azuremlappname: blue-sklearn-regression-do
isazuremlapp: "true"
ml.azure.com/compute: do-aks
ml.azure.com/deployment-name: blue
ml.azure.com/endpoint-name: sklearn-regression-do
ml.azure.com/identity: blue-sklearn-regression-do
ml.azure.com/resource-group: <rg name>
ml.azure.com/scrape-metrics: "true"
ml.azure.com/subscription-id: <sub id>
ml.azure.com/workspace: jm-ml
name: blue-sklearn-regression-do
namespace: azureml-arc
spec:
minReadySeconds: 10
progressDeadlineSeconds: 600
replicas: 1
selector:
matchLabels:
azuremlappname: blue-sklearn-regression-do
isazuremlapp: "true"
strategy:
rollingUpdate:
maxSurge: 1
maxUnavailable: 1
type: RollingUpdate
template:
metadata:
labels:
azuremlappname: blue-sklearn-regression-do
isazuremlapp: "true"
ml.azure.com/compute: do-aks
ml.azure.com/deployment-name: blue
ml.azure.com/endpoint-name: sklearn-regression-do
ml.azure.com/identity: blue-sklearn-regression-do
ml.azure.com/resource-group: <rg name>
ml.azure.com/scrape-metrics: "true"
ml.azure.com/subscription-id: <sub id>
ml.azure.com/workspace: <workspace name>
spec:
automountServiceAccountToken: true
containers:
- command:
- runsvdir
- /var/runit
env:
- name: AML_APP_ROOT
value: /var/azureml-app/onlinescoring
- name: AZUREML_ENTRY_SCRIPT
value: score.py
- name: AZUREML_MODEL_DIR
value: /var/azureml-app/azureml-models/<guid>/1
- name: SERVICE_NAME
value: sklearn-regression-do
- name: SERVICE_PATH_PREFIX
value: api/v1/endpoint/sklearn-regression-do
image: <acr name>.azurecr.io/azureml/<image repo name>
imagePullPolicy: IfNotPresent
livenessProbe:
failureThreshold: 30
httpGet:
path: /
port: 5001
scheme: HTTP
initialDelaySeconds: 10
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 2
name: inference-server
ports:
- containerPort: 5001
protocol: TCP
readinessProbe:
failureThreshold: 30
httpGet:
path: /
port: 5001
scheme: HTTP
initialDelaySeconds: 10
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 2
resources:
limits:
cpu: "2"
memory: 8Gi
requests:
cpu: 100m
memory: 512Mi
securityContext:
privileged: false
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /var/azureml-app
name: model-mount-0
readOnly: true
dnsPolicy: ClusterFirst
initContainers:
- env:
- name: STORAGE_MANIFEST_URL
value: <url>
- name: STORAGE_DOWNLOAD_PATH
value: /var/azureml-app
- name: STORAGE_CREDENTIAL_CLIENTID
value: <value>
image: mcr.microsoft.com/mir/mir-storageinitializer:46571814.1631244300887
imagePullPolicy: IfNotPresent
name: storageinitializer-modeldata
resources:
limits:
cpu: 100m
memory: 500Mi
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /var/azureml-app
name: model-mount-0
restartPolicy: Always
schedulerName: default-scheduler
securityContext: {}
serviceAccount: blue-sklearn-regression-do
serviceAccountName: blue-sklearn-regression-do
terminationGracePeriodSeconds: 30
tolerations:
- key: ml.azure.com/amlarc
operator: Equal
value: "true"
- key: ml.azure.com/amlarc-workload
operator: Equal
value: "true"
- key: ml.azure.com/resource-group
operator: Equal
value: <rg-name>
- key: ml.azure.com/workspace
operator: Equal
value: jm-ml
- key: ml.azure.com/compute
operator: Equal
value: do-aks
volumes:
- emptyDir: {}
name: model-mount-0

Model Kubernetes Service

apiVersion: v1
kind: Service
metadata:
creationTimestamp: "2024-02-16T16:10:00Z"
labels:
autoscale.enabled: "true"
autoscale.max_replicas: "1"
autoscale.min_replicas: "1"
autoscale.refresh_period_in_sec: "1"
autoscale.target_utilization: "70"
azuremlappname: blue-sklearn-regression-do
isazuremlapp: "true"
max_concurrent_requests_per_container: "1"
max_queue_wait_ms: "500"
ml.azure.com/compute: do-aks
ml.azure.com/deployment-name: blue
ml.azure.com/endpoint-name: sklearn-regression-do
ml.azure.com/identity: blue-sklearn-regression-do
ml.azure.com/resource-group: <rg-name>
ml.azure.com/scrape-metrics: "true"
ml.azure.com/subscription-id: <sub id>
ml.azure.com/workspace: jm-ml
request_timeout_ms: "5000"
routing_algorithm: CoordinatedLeastLoaded
name: blue-sklearn-regression-do
namespace: azureml-arc
spec:
internalTrafficPolicy: Cluster
ipFamilies:
- IPv4
ipFamilyPolicy: SingleStack
ports:
- port: 80
protocol: TCP
targetPort: 5001
selector:
azuremlappname: blue-sklearn-regression-do
sessionAffinity: None
type: ClusterIP

Model Endpoint Kubernetes ConfigMap

apiVersion: v1
kind: ConfigMap
metadata:
labels:
ABTest: "true"
name: sklearn-regression-do
namespace: azureml-arc
data:
aadAuthEnabled: "false"
endpoint: sklearn-regression-do
keyAuthEnabled: "true"
primaryKey: <key 1 value>
secondaryKey: <key 2 value>
versions: '[{"name":"blue-sklearn-regression-do","trafficPercentile":100}]'
workspace.id: <workspace id>
workspace.region: centralus

Environments And Inference Server Containers in Azure Machine Learning Service

By investigating the online deployment pod, we notice that the container that runs our model, also known as inference server comes from the container registry we attached to the ML workspace. How did it get created?

Creating the Deployment Inference Server

  • Model and Scoring files: When we create the model into the workspace studio, model files are stored in the workspace storage account. You can find model files under Data>Datastores>workspaceblobstore>WebUpload or simply browse the storage account containers.
  • Environment and Base Images: When you specify a curated or custom environment, you are specifying which base base to use in a Dockerfile, to create the inference server container. Conda files are used to install dependencies while creating the image. See the Dockerfile sample above.
FROM mcr.microsoft.com/azureml/openmpi4.1.0-ubuntu20.04
USER root
RUN mkdir -p $HOME/.cache
WORKDIR /
COPY azureml-environment-setup/99brokenproxy /etc/apt/apt.conf.d/
RUN if dpkg --compare-versions `conda --version | grep -oE '[^ ]+$'` lt 4.4.11; then conda install conda==4.4.11; fi
COPY azureml-environment-setup/mutated_conda_dependencies.yml azureml-environment-setup/mutated_conda_dependencies.yml
RUN ldconfig /usr/local/cuda/lib64/stubs && conda env create -p /azureml-envs/azureml_<guid> -f azureml-environment-setup/mutated_conda_dependencies.yml && rm -rf "$HOME/.cache/pip" && conda clean -aqy && CONDA_ROOT_DIR=$(conda info --root) && rm -rf "$CONDA_ROOT_DIR/pkgs" && find "$CONDA_ROOT_DIR" -type d -name __pycache__ -exec rm -rf {} + && ldconfig
ENV PATH /azureml-envs/azureml_<guid>/bin:$PATH
COPY azureml-environment-setup/send_conda_dependencies.py azureml-environment-setup/send_conda_dependencies.py
RUN echo "Copying environment context"
COPY azureml-environment-setup/environment_context.json azureml-environment-setup/environment_context.json
RUN python /azureml-environment-setup/send_conda_dependencies.py -p /azureml-envs/azureml_<guid>
ENV AZUREML_CONDA_ENVIRONMENT_PATH /azureml-envs/azureml_<guid>
ENV LD_LIBRARY_PATH /azureml-envs/azureml_<guid>/lib:$LD_LIBRARY_PATH
ENV CONDA_DEFAULT_ENV=azureml_<guid> CONDA_PREFIX=/azureml-envs/azureml_<guid>
COPY azureml-environment-setup/spark_cache.py azureml-environment-setup/log4j.properties /azureml-environment-setup/
RUN if [ $SPARK_HOME ]; then /bin/bash -c '$SPARK_HOME/bin/spark-submit /azureml-environment-setup/spark_cache.py'; fi
RUN rm -rf azureml-environment-setup
ENV AZUREML_ENVIRONMENT_IMAGE True
CMD ["bash"]
  • Once the environment is created, an inference server container is created and pushed to the container registry attached to the ML workspace.

Where do we build the inference server container?

Depending on the network configuration, three scenarios represent where inference server container:

  • ACR Tasks: Public container registry: the inference server is built using ACR Tasks on managed public runners.
  • Private Docker build Compute: If the container registry is only accessible through private endpoint, ACR Tasks wont be able to reach your container. You must specify a compute cluster what would be used to run docker job with — image-build-compute when creating/updating the workspace. The compute cluster must be on subnet that has access to the container registry.
  • Serverless Compute on the same VNET: The private docker build compute will be replaced by serverless compute deployed on the same VNET. Read More.

Storage Initialization and Model files volume

  • ML OnlineEndpoint Deployment: When we create an online endpoint deployment, we are asking the ML platform to create a Kubernetes deployment with: Inference Server (main container), Identity Server (sidecar), and Storage Initializer (init container)
  • The Storage Initializer downloads model and scoring scripts from the storage account and storage them in an emptyDir volume. The volume is mounted in Inference Server at /var/azureml-app
kubectl exec <pod name> -it -n azureml-arc -- sh

Once inside the pod shell,

#change directories into the model files directory
cd /var/azureml-app/azureml-models/<some string>/1
# once inside, notice the file you uploaded when you created the model

#change directories into the model files
cd /var/azureml-app/onlinescoring
# once inside, notice the files you uploaded when creating the online deployment
Model and scoring files in the inference server file system

Inference Pod start up flow:

  • 1- When the model and deployment are created, files are storage into the storage account attached to the workspace
  • 2- When the environment is specified (curated or custom), the inference server container is built and pushed to the container registry attached to the workspace
  • 3- When the deployment is created, the identity controllers injects a identity-server container into the model deployment to pull credentials needed to connect to the storage account
  • 4- The storageinitializer init container connects to the storage account, pulls the model file, scoring files into an emtyDir volume for the inference server to consume.
  • 5- When the inference pods is scheduled, the inference server container is pulled from the container registry
  • 6- The inference server loads model files into the file system at /var/azureml-app
  • 7- Once the inference server is ready, its traffic is served by the inference router through its service.

Customizing the Inference Server

Resource Allocations and Probes

As we saw earlier, we do not create inference deployments directly through YAML. The ML controller creates the K8s deployment from the ML OnlineDeployment CRD. We configure the resource allocation on the OnlineDeployment Object. This CRD allows you to specify InstanceType, liveness and readiness probes, and resourceRequests.


# onlineDeployment yaml
# other properties skipped

livenessProbe:
failureThreshold: 30
httpMethod: GET
initialDelaySeconds: 10
path: /
periodSeconds: 10
port: 5001
scheme: HTTP
successThreshold: 1
timeoutSeconds: 2
readinessProbe:
failureThreshold: 30
httpMethod: GET
initialDelaySeconds: 10
path: /
periodSeconds: 10
port: 5001
scheme: HTTP
successThreshold: 1
timeoutSeconds: 2
resourceRequests:
cpu: 100m
cpuLimit: "2"
memory: 0.5Gi
memoryLimit: 8Gi

Taints and Tolerations

Azure ML allows us to specify tolerations for built-in taints if being used.

    
# onlineDeployment yaml
# other properties skipped
tolerations:
- key: ml.azure.com/amlarc
operator: Equal
value: "true"
- key: ml.azure.com/amlarc-workload
operator: Equal
value: "true"
- key: ml.azure.com/resource-group
operator: Equal
value: <rg-name>
- key: ml.azure.com/workspace
operator: Equal
value: jm-ml
- key: ml.azure.com/compute
operator: Equal
value: do-aks

Scaling The Inference Server

The OnlineDeployment CRD, allows us to specify scaleSettings for the inference pod, BUT, we must not enable autoscaling through Horizontal Pod Autoscaler - HPA or KEDA because the inference router discussed earlier takes care of model inference server scaling based on the incoming traffic. Read more on this link:


# onlineDeployment yaml
# other properties skipped
scaleSettings:
maximumInstanceCount: 1
minimumInstanceCount: 1
refreshPeriodInSec: 1
scaleType: Auto
targetUtilization: 70

--

--