Orchestrating Azure Machine Learning AutoML pipeline (V1) using Data Factory

4 min readFeb 7, 2023

Azure Machine Learning (AML) empowers data scientists and developers to build, deploy, and manage high-quality models faster and with confidence. It accelerates time to value with industry-leading machine learning operations (MLOps), open-source interoperability, and integrated tools.

Azure Data Factory (ADF) is Azure’s cloud ETL service for scale-out serverless data integration and data transformation. It offers a code-free UI for intuitive authoring and single-pane-of-glass monitoring and management.

Machine learning training/inference is usually done after the data is cleansed/enriched using the ETL pipelines. In this article, we’ll explore how to orchestrate an AML training pipeline using ADF. We will only focus on integration of ADF with AML, we will not go into the ETL side. An example flow looks like this:

Azure Storage is used for storing data coming in from various systems
Azure Databricks is used for data processing/engineering
Azure Cosmos DB for serving the data
Azure Machine Learning for model training
Azure Kubernetes service for model serving or inferencing

We are going to focus on how to call the training pipeline AML from ADF.

Prerequisites:

Azure subscription
Resource group — create a RG
Azure Machine Learning
Azure Data Factory
Azure Databricks (optional, this is for data engineering/preparation. Out of scope for this article)

Setup:

For Data Factory to be able to trigger AML pipelines you need to grant ADF access to AML. Below steps shows you how:

Navigate to your AML service and in the left hand pane select Access control (IAM)
Click on Add -> Role Assignment

Search for machine learning in the search bar and select the AzureML Data Scientist role and click Next

Select Managed Identity in “Assign access to”
Select your subscription in the form that opens
Select Data Factory (V2) under managed identity
Select your Data Factory name and press select
Review and Assign

Creating Azure Machine Learning Pipeline:

The second step is to create a training pipeline. Use this git repo to get the code into your AML workspace and execute the code. This notebook uses the Hierarchical Timeseries forecasting capabilities of AML V1. Please do check step 2 in the notebook. You need to upload the data to the BLOB store. We could have done this automatically using code, by reading from the data folder and uploading to BLOB, but the idea here is that the data would be prepared by an ETL tool which would then be called by ADF

Note — you will need Compute instance to execute the notebook — instruction here

Navigate to the AML service you created
Click in Launch Studio
In the left hand pane, click on Notebooks under Authoring
On the file explorer pane, click on Terminal (You will be asked to create a compute instance if not already created)

git clone https://github.com/divssheth/forecasting-hierarchical-timeseries.git

Upload data to BLOB store (in my example, I’ve uploaded it to hts-sample folder in BLOB)
Execute the notebook and make note of the Id in the output of the last command i.e. pipeline.publish()

Create Data Factory Pipeline:

Setup Linked Service

In a new tab navigate to Data Factory and click on Launch Studio

Click on Manage icon in the left pane (little briefcase with spanner)
Select Linked Services
Click New or Create Linked Service
Select Compute and then Azure Machine Learning and click continue

Select your Subscription and Machine Learning workspace and click on Test Connection
Upon success, click on Create

Create Pipeline

In the Data Factory studio, click on Create pipeline
On the top right corner you will see JSON editor next to the Properties button

Add the below JSON in the editor

{
    "name": "pipeline1",
    "properties": {
        "activities": [
            {
                "name": "Machine Learning Execute Pipeline1",
                "type": "AzureMLExecutePipeline",
                "dependsOn": [],
                "policy": {
                    "timeout": "0.12:00:00",
                    "retry": 0,
                    "retryIntervalInSeconds": 30,
                    "secureOutput": false,
                    "secureInput": false
                },
                "userProperties": [],
                "typeProperties": {
                    "mlPipelineId": "88e2ff33-14e9-463e-893c-2deed4652e66"
                },
                "linkedServiceName": {
                    "referenceName": "AzureMLService1",
                    "type": "LinkedServiceReference"
                }
            }
        ],
        "annotations": []
    }
}

Replace the “name” with your pipeline name
Replace “referenceName” in linkedServiceName with the LinkedService name you provided to AML
Replace “mlPipelineId” with the Id output from AML pipeline
Click on Publish All

You can now schedule/trigger the pipeline.

I’ve added a test.ipynb file in my git repo, follow the same steps as above to run inference pipeline

References:

Code based on — https://github.com/Azure/azureml-examples/tree/main/v1/python-sdk/tutorials/automl-with-azureml/forecasting-hierarchical-timeseries

Architecture diagram from — https://learn.microsoft.com/en-us/azure/architecture/reference-architectures/ai/real-time-recommendation

Orchestrating Azure Machine Learning AutoML pipeline (V1) using Data Factory

Written by Divye Sheth