Orchestrating Azure Machine Learning AutoML pipeline (V1) using Data Factory

Divye Sheth
4 min readFeb 7, 2023

--

Azure Machine Learning (AML) empowers data scientists and developers to build, deploy, and manage high-quality models faster and with confidence. It accelerates time to value with industry-leading machine learning operations (MLOps), open-source interoperability, and integrated tools.

Azure Data Factory (ADF) is Azure’s cloud ETL service for scale-out serverless data integration and data transformation. It offers a code-free UI for intuitive authoring and single-pane-of-glass monitoring and management.

Machine learning training/inference is usually done after the data is cleansed/enriched using the ETL pipelines. In this article, we’ll explore how to orchestrate an AML training pipeline using ADF. We will only focus on integration of ADF with AML, we will not go into the ETL side. An example flow looks like this:

Azure machine learning flow
  • Azure Storage is used for storing data coming in from various systems
  • Azure Databricks is used for data processing/engineering
  • Azure Cosmos DB for serving the data
  • Azure Machine Learning for model training
  • Azure Kubernetes service for model serving or inferencing

We are going to focus on how to call the training pipeline AML from ADF.

Prerequisites:

  • Azure subscription
  • Resource group — create a RG
  • Azure Machine Learning
  • Azure Data Factory
  • Azure Databricks (optional, this is for data engineering/preparation. Out of scope for this article)

Setup:

For Data Factory to be able to trigger AML pipelines you need to grant ADF access to AML. Below steps shows you how:

  • Navigate to your AML service and in the left hand pane select Access control (IAM)
  • Click on Add -> Role Assignment
Add role assignment
  • Search for machine learning in the search bar and select the AzureML Data Scientist role and click Next
Role selection
  • Select Managed Identity in “Assign access to”
  • Select your subscription in the form that opens
  • Select Data Factory (V2) under managed identity
  • Select your Data Factory name and press select
  • Review and Assign
Assign ADF access to AML

Creating Azure Machine Learning Pipeline:

The second step is to create a training pipeline. Use this git repo to get the code into your AML workspace and execute the code. This notebook uses the Hierarchical Timeseries forecasting capabilities of AML V1. Please do check step 2 in the notebook. You need to upload the data to the BLOB store. We could have done this automatically using code, by reading from the data folder and uploading to BLOB, but the idea here is that the data would be prepared by an ETL tool which would then be called by ADF

Note — you will need Compute instance to execute the notebook — instruction here

  • Navigate to the AML service you created
  • Click in Launch Studio
  • In the left hand pane, click on Notebooks under Authoring
  • On the file explorer pane, click on Terminal (You will be asked to create a compute instance if not already created)
Terminal in AML
git clone https://github.com/divssheth/forecasting-hierarchical-timeseries.git
  • Upload data to BLOB store (in my example, I’ve uploaded it to hts-sample folder in BLOB)
  • Execute the notebook and make note of the Id in the output of the last command i.e. pipeline.publish()
Id column of pipeline publish

Create Data Factory Pipeline:

Setup Linked Service

In a new tab navigate to Data Factory and click on Launch Studio

  • Click on Manage icon in the left pane (little briefcase with spanner)
  • Select Linked Services
  • Click New or Create Linked Service
  • Select Compute and then Azure Machine Learning and click continue
Azure Machine Learning Linked Service
  • Select your Subscription and Machine Learning workspace and click on Test Connection
  • Upon success, click on Create

Create Pipeline

  • In the Data Factory studio, click on Create pipeline
  • On the top right corner you will see JSON editor next to the Properties button
Data factory pipeline
  • Add the below JSON in the editor
{
"name": "pipeline1",
"properties": {
"activities": [
{
"name": "Machine Learning Execute Pipeline1",
"type": "AzureMLExecutePipeline",
"dependsOn": [],
"policy": {
"timeout": "0.12:00:00",
"retry": 0,
"retryIntervalInSeconds": 30,
"secureOutput": false,
"secureInput": false
},
"userProperties": [],
"typeProperties": {
"mlPipelineId": "88e2ff33-14e9-463e-893c-2deed4652e66"
},
"linkedServiceName": {
"referenceName": "AzureMLService1",
"type": "LinkedServiceReference"
}
}
],
"annotations": []
}
}
  • Replace the “name” with your pipeline name
  • Replace “referenceName” in linkedServiceName with the LinkedService name you provided to AML
  • Replace “mlPipelineId” with the Id output from AML pipeline
  • Click on Publish All

You can now schedule/trigger the pipeline.

I’ve added a test.ipynb file in my git repo, follow the same steps as above to run inference pipeline

References:

Code based on — https://github.com/Azure/azureml-examples/tree/main/v1/python-sdk/tutorials/automl-with-azureml/forecasting-hierarchical-timeseries

Architecture diagram from — https://learn.microsoft.com/en-us/azure/architecture/reference-architectures/ai/real-time-recommendation

--

--