Vulnerability Remediation for Azure Machine Learning Computes

Published in

Henkel Data & Analytics Blog

8 min readDec 19, 2023

Vulnerability management plays a pivotal role in securing an organization’s IT infrastructure. Effective vulnerability management involves cyber threat identification, prioritization, remediation, and reporting. In this article, we discuss how we address the remediation of vulnerabilities in Azure Machine Learning computes. We explain how we use centrally-managed infrastructure automation to provide a secure space for Data Scientists to experiment and develop machine learning models.

Azure Machine Learning (AML) is a machine-learning-as-a-service (MLaaS) platform that allows Data Scientists and Machine Learning Engineers to build, deploy, and manage machine learning models. In our data science team, we use AML to implement MLOps best practices and bring ML models into production. An AML service involves different components, including Azure Storage Account, Azure Container Registry, Azure Key Vault, Azure Application Insights, and compute resources. In AML, there are two Microsoft-managed compute resources: AML compute clusters and compute instances. A compute instance is a dedicated single-node workstation for a Data Scientist to run notebooks and experiments to train ML models. An AML compute cluster can have one or more compute nodes that can autoscale and be used in production. Unlike compute instances, compute clusters can be shared with other Data Scientists in the AML workspace.

In this blog article, we discuss the remediation of vulnerabilities in AML compute instances. We use Microsoft Defender for Cloud to continuously monitor and identify potential security vulnerabilities in Azure resources. One of the most frequent security vulnerability is:

Azure Machine Learning compute instances should be recreated to get the latest software updates.

Every month, Microsoft releases new VM images. An AML compute instance once deployed, does not get actively updated. Further, the compute instances are not recreated automatically. As we have many compute instances, automation must be applied to get the latest software updates and security patches. Note that the compute cluster automatically upgrades to the latest VM image if the minimum node of the cluster is set to zero and if there are no jobs in the running state.

Creation of AML computes with the Python SDK

Data Scientists can create AML computes (both compute instances and compute clusters) in three different ways: The AML studio, the AML SDK, or with the Azure CLI. To have computes with the required configurations of a virtual network and local authentication disabled, we automate the creation of computes using the AML SDK. This automation ensures a secure and compliant setup of computes, while at the same time we create a simple and easy way for Data Scientists to create their computes by hiding the technical complexity.

As part of the self-service tool to provision infrastructure, we use an Azure DevOps pipeline to create AML computes for a specific environment. The Azure DevOps pipeline runs a Python script written on AML SDK v2. To run the pipeline, it is required to have an Azure DevOps service connection with a service principal, which has access to the Microsoft Graph API and the Azure Resource Manager. Further, the pipeline takes the following parameters:

parameters:
    - name: computeType
        type: string
        values: 
        - "instance"
        - "cluster"
        displayName: AML compute type
    - name: environment
        type: string
        values:
        - "dev"
        - "qua"
        displayName: Test environment
    - name: userId
        type: string
        default: "userId"
        displayName: User ID
    - name: userEmail
        type: string
        default: ""
        displayName: User email address
    - name: computeSize
        type: string
        default: ""
        displayName: Size of the compute, e.g. Standard_DS11_v2, STANDARD_DS3_V2

The computeType parameter allows Data Scientists to choose between a compute instance and a compute cluster for training a machine learning model or deploying the model to an endpoint. To ensure a unique naming convention for the computes, the pipeline parameters such as environment, userID, and the last four characters of computeSize will be part of the compute name. Note that the names of compute instances need to have a unique name across the Azure region. Further, adding the last four characters of computeSize to the compute name allows Data Scientists to create multiple computes with different compute sizes, and avoiding collisions in the compute names.

As of today, the SDK v2 only allows a compute instance to be assigned to a specific user through the use of their user object ID. Therefore, we get the user object ID from the supplied userEmail parameter using the Microsoft Graph API.

from azure.ai.ml import MLClient
from azure.identity import ClientSecretCredential
credential = ClientSecretCredential(
        tenant_id=tenant_id, client_id=client_id, client_secret=client_secret)
ml_client = MLClient(credential=credential,
                         subscription_id=subscription_id,
                         resource_group_name=resource_group_name,
                         workspace_name=workspace_name) 
graph_token = credential.get_token("https://graph.microsoft.com/.default").token
URL = f"https://graph.microsoft.com/v1.0/users/" + user_email
header = {'Authorization': 'Bearer ' + graph_token}
response = requests.get(URL, headers=header)
result = response.json()
user_object_id = result['id']

Computes creation with the SDK v2 currently works only when the virtual network is in the same resource group as that of the AML workspace. One of our security requirements is to keep the virtual network in a different resource group. Hence, we have modified the ComputeInstance and AmlCompute classes of the SDK to accept the virtual network configuration even if the virtual network resides in a different resource group. Further, we ensure to shut down compute instances after 120 minutes of inactivity.

In the case of AML clusters, we set the minimum node to zero to get the latest VM image and security patches automatically. This configuration also allows AML to dynamically de-allocate the nodes when they aren’t in use.

from azure.ai.ml import MLClient
from azure.identity import ClientSecretCredential
from azure.ai.ml.entities import AssignedUserConfiguration  
credential = ClientSecretCredential(
        tenant_id=tenant_id, client_id=client_id, client_secret=client_secret)
ml_client = MLClient(credential=credential,
                         subscription_id=subscription_id,
                         resource_group_name=resource_group_name,
                         workspace_name=workspace_name) 
ci_user = AssignedUserConfiguration(
            user_tenant_id=tenant_id,
            user_object_id=user_object_id)
compute_instance = ComputeInstance(name=compute_name, 
                                   size=compute_size,
                                   create_on_behalf_of=ci_user, 
                                   network_settings=network_settings,
                                   rg_vnet=rg_vnet,
                                   ssh_public_access_enabled=False,
                                   enable_node_public_ip=False,
                                   idle_time_before_shutdown_minutes=120)    
ml_client.begin_create_or_update(compute_instance).result()

An overview of compute creation using the Azure DevOps pipeline is depicted in the diagram below. With our automation, we provide an easy, secured and streamlined way of creating compute instances or clusters with the necessary computing capacity for Data Scientists.

Compute instance — creation process flow — Creation of AML computes via Azure DevOps pipeline

Re-creation of AML compute instances to get the latest software updates

Microsoft recommends recreating compute instances to get the latest software updates for its compute VMs. The manual re-creation of the compute instances could introduce errors or simply be missed, compromising the security of our computes. As part of our MLOps practices, we aim to streamline the process of re-creation and avoid manual re-creation. Thus, we decided to automate the process of recreating AML compute instances and, with that, provide Data Scientists with a safe AML workspace to operate in. We implement the re-creation of compute instances as a two-step process consisting of the deletion of vulnerable resources and the creation of new resources.

An overview of the scheduled automation of compute instance re-creation is shown in the diagram below. We utilize the Azure Resource Graph and Azure DevOps for the automation.

Compute instance — recreation process flow — Overview of re-creation of AML compute instances

We use an Azure DevOps pipeline that runs nightly to automate the process of recreating compute instances that have security vulnerabilities. This Azure DevOps pipeline runs a Python script using the AML SDK v2. We query the Azure Resource Graph per subscription to get the list of vulnerable computes across all the Azure subscriptions. The result of the query is a Python dictionary containing the compute name, subscription id, resource group, and workspace name. These details are subsequently used by the script as input parameters for the Azure Resource Manager API request. The properties of compute instances, such as network settings, assigned user configuration are contained in the response from the API. The script also checks whether the compute instance has been stopped or is still running. For those compute instances that are in the stopped state, the script finally executes the re-creation of computes step.

import azure.mgmt.resourcegraph as arg
query = '''SecurityResources
    | where type == 'microsoft.security/assessments'
    | where properties.displayName contains 'Azure Machine Learning compute instances should be recreated to get the latest software updates'
    | extend displayName = properties.displayName
	| extend resourceName = properties.resourceDetails.ResourceName
	| extend status = properties.status.code
    | extend resourceGroup = resourceGroup
    | where status == "Unhealthy"'''
          
# Create Azure Resource Graph client and set options
arg_client = arg.ResourceGraphClient(credential)
arg_query_option = arg.models.QueryRequestOptio(result_format="objectArray")    
# Create query
arg_query = arg.models.QueryRequest(subscriptions=subs_list, query=query, options=arg_query_option)
arg_results = arg_client.resources(arg_query)
if len(arg_results.data):        
    token = credential.get_token("https://management.azure.com/.default").token
    querystring = {"api-version": "2023-04-01-preview"}
    headers = {'Authorization': f'Bearer {token}', 'Content-Type': 'application/json'}
    for idx in range(len(res_list)):
        url = f"https://management.azure.com/subscriptions/{subscription_id}/resourceGroups/{resource_group_name}/providers/Microsoft.MachineLearningServices/workspaces/{workspace_name}/computes/{compute_name}"
        response = requests.request("GET", url, params=querystring, headers=headers)
        data = response.json()
        if data['properties']['properties']['state'] == "Stopped":
            compute_recreate(workspace_name, resource_group_name, subscription_id, compute_name, data)

Known limitations

Data and installed libraries that are stored on the OS or temporary disks of the compute instance will be lost upon re-creation. Aside from libraries, the most frequently stored contents on the compute instance’s OS are SSH keys, scripts or notebooks, and personal Git configurations such as emails, passwords, or tokens.

For the SSH key, Data Scientists need to set it up again after re-creation of their compute instances. Also, they need to re-enter their Git credentials. To mitigate these limitations, one can use setup scripts for computes. However, in our case, the setup scripts cannot be included in the automation as they involve user credentials.

As files (e.g., notebooks) shouldn’t be stored on the compute, the best practice is to keep them on Git. The Data Scientists can also use the User folder that persists files in the AML workspace. For Python environments, while lost, they can simply recreate them using the conda.yml or requirements.txt file. However, we prefer to use AML environments for registering and updating Python environments as part of our MLOps practices, which also ensures reproducibility.

Conclusion

One can leverage the Azure Resource Manager and the Microsoft Graph API for automations, as they allow to query, create, delete, and update various Microsoft services. In this article, we have shown how vulnerability management for AML computes can be automated using the Azure DevOps, Microsoft Graph API, and the Azure Resource Manager.

Vulnerability management helps to improve the overall level of the organization’s cybersecurity. Ensuring security in Machine Learning Systems is an integral part of the MLOps best practices. With our MLOps practices, we constantly strive to ensure a secured environment for ML development and operations.

Whether shampoo, detergent, or industrial adhesive — Henkel stands for strong brands, innovations, and technologies. In our data science, engineering, and analytics teams we solve modern data challenges for the benefit of our customers. Learn more at henkel.com/digitalization.

Vulnerability Remediation for Azure Machine Learning Computes

Creation of AML computes with the Python SDK

Re-creation of AML compute instances to get the latest software updates

Known limitations

Conclusion

Written by Henkel Data & Analytics