A woman sitting at a table with headphones on, with an open laptop next to her. The woman is taking notes in a notebook.

How to use Azure ML Studio: An Eye-Opening Model Training Tutorial for Beginners from Henkel’s Data & Analytics Experts

Henkel Data & Analytics
Henkel Data & Analytics Blog
13 min readFeb 27, 2024

--

By Florian Roscheck.

In this article, we use code examples and simple explanations to dive into the features of Azure Machine Learning. The cloud-based machine learning platform is an efficiency booster for our data scientists at Henkel who use it for rapid experimentation and scaling of data science workloads. If you are looking to get an overview of how Azure ML works and how you can get started, then read on. We will be training a simple machine learning model on the all-time classic Titanic dataset while using some of the most powerful features of Azure ML Studio.

To understand this tutorial, you should have basic data science knowledge. If you would like to run the code in this article, you also need an Azure ML Workspace which you can get through an Azure ML subscription.

Overview

This tutorial will give you insights into using the Compute, Data, Jobs, and Models components of Azure ML. The diagram below gives an overview of how we will make these components work together. Using a Compute, we will download and create a data asset and a training script. In combination with an Environment, we will then run a training job on a Compute Cluster. As the result of the training job, we will receive a model and training metrics which we will then inspect. Feel free to use the diagram as a reference as you follow along the explanations and code below.

A diagram showing how a compute instance creates data asset and training script. Together with the environment, these are fed into a job running on a compute cluster. The outputs of the job are a model and metrics.
Overview about components used in this tutorial and their interactions

Compute

To interact with Azure ML, you need some kind of computer. It can be your laptop. But, most probably, it is a computer in the cloud. This is known as Compute in Azure ML. Some computes in Azure ML are intended to be interactive. For example, you can log in and use Jupyter on them. This is great for developing code for training a machine learning model. These computes are called Compute Instances . In Azure ML, Compute Instances come with predefined conda environments that help you get started with data science quickly. Compute Instances are like personal “data science computers” in the cloud.

But when you train a model, you might need powerful and expensive computers. You don’t want to leave these computers running without purpose and they should shut down whenever model training has completed or in case of an error. For model training in Azure ML, you should use a component called Compute Cluster. Compute clusters are collections of computers in the cloud that are tightly controlled by Azure ML. Azure ML starts and stops computers in a compute cluster as required for model training so that you only pay for them when you need them.

As we go through this article, we will be executing code in a Jupyter Notebook on a Compute Instance. For model training, we have defined a Compute Cluster. You can set up both in your own Azure ML workspace following Microsoft’s guide to create a Compute Instance and guide to create a Compute Cluster.

Data

To train a machine learning model, you need data. Azure ML has the Data component for this. Once you are logged into the platform, you can click on this symbol in the sidebar to see what data is available to you:

A screenshot of Microsoft Azure Machine Learnings, showing the icon for the Data component

Azure ML can use data from many diverse sources, like files stored on cloud storage, at a web address, or even in databases. Something that I found confusing when starting with Azure ML is that what you find in the Data component does not necessarily have to be stored in the Azure ML platform. “Data asset” in Azure ML means references and links to data that is stored either on cloud storage attached to Azure ML or in other remote systems. Microsoft explains data assets in its documentation.

Let’s create a data asset using Python in Azure ML. We will be downloading the Titanic dataset and using the Azure ML Python library to upload it to Azure ML and register it as a data asset.

To start, open Jupyter on the Compute Instance. Then, in a new notebook with the Python 3.10 - SDK v2 kernel, add and execute the following code in the first cell to download the Titanic dataset to a new folder on the compute instance called “data”:

!wget https://biostat.app.vumc.org/wiki/pub/Main/DataSets/titanic3.csv -P data/

You can learn more about the Titanic dataset on hbiostat.org. Please also note the following reference about the Titanic dataset we use here:

Data obtained from http://hbiostat.org/data courtesy of the Vanderbilt University Department of Biostatistics.

Before we can upload data to Azure ML, we first need to set up a connection from the Jupyter notebook to the Azure ML workspace. This is the purpose of the following block of code. In this code, we obtain the Workspace details and the necessary credentials for connecting to Azure ML. Azure ML-managed computes automatically provide credentials in their environment — you would have to set this up manually if you were trying to run this code on your computer. You can find more information about setting up the connection in Microsoft’s documentation.

# Connect Azure ML client to Azure ML workspace

# First, we import libraries we need for establishing the connection
from azureml.core import Workspace
from azure.ai.ml import MLClient
from azure.identity import DefaultAzureCredential

ws = Workspace.from_config()

ml_client = MLClient(
DefaultAzureCredential(),
ws.subscription_id,
ws.resource_group,
ws.name
)

Now, we are ready! Let’s use the following code to define the details of the data asset and upload it to Azure ML. You can learn more about how to upload data to Azure ML in this tutorial from Microsoft.

from azure.ai.ml.entities import Data
from azure.ai.ml.constants import AssetTypes

# Path to the data we just downloaded
data_path = "./data/titanic3.csv"

# Version for dataset we are creating
data_version = "1"

offline_data_asset = Data(
name="titanic",
version=data_version,
description="Titanic dataset, from here: "
"https://biostat.app.vumc.org/wiki/pub/Main/DataSets/titanic3.csv",
path=data_path,
type=AssetTypes.URI_FILE,
)

# Create data asset on Azure ML
online_data_asset = ml_client.data.create_or_update(offline_data_asset)

What exactly is happening here? First, we create a data asset offline — in the memory of the Python interpreter of the Jupyter notebook. Then, we create the data asset in the Azure ML cloud environment.

Let’s check out the data asset in the Azure ML workspace. Here is what we get:

A screenshot of Microsoft Azure Machine Learnings, showing the details page of the dataset

A particularly useful feature of Azure ML is that you can version datasets. Versioning helps make your data science efforts reproducible. In the code, we assigned the simple version name “1”.

Data assets are easy to share and explore. Your colleagues who have access to the same Azure ML workspace can also access this data asset. Clicking on “Explore” will help you to have a glance at the data. Looking into the data this way, without downloading it, can save you time when deciding about which dataset to use in your machine learning task.

Training a Machine Learning Model

Let’s get to the most exciting part of the journey, at least for a data scientist: Training a machine learning model. Azure ML has many practical tools available for modeling like automated machine learning and modeling pipelines. But in this article, we want to stick to the basics. So, let’s train a simple model for a classification task on the Titanic dataset in the most traditional way.

Our classification task is as follows: Based on some of the data we know about passengers on the Titanic, we would like to predict whether these passengers survived the terrible Titanic accident. This is an immensely popular prediction problem and as we are primarily focused on learning about Azure ML, we will not get into any depth with the problem itself.

As much as we want to keep it simple, the one complexity we will bring in for your benefit is that we will train our survival prediction model remotely. We will have it trained on a Compute Cluster. This is a very scalable approach because, with the same code scaffold presented here, you could train a complex model on GPUs in such a Compute Cluster.

So, what do we need to train a model on a Compute Cluster? In this tutorial, we save our training code to a file so that we can send it from the Jupyter Notebook to the cluster. To use the Titanic dataset we uploaded to Azure ML earlier, we will have to connect it to the code running in the cluster as well. Finally, we also want to take advantage of the tight integration of Azure ML with the MLFlow machine learning framework. This will enable us to explore model training and evaluation results from the Azure ML user interface.

You can create the training code file directly from the Jupyter Notebook. First, create a new directory where we will store our model training code, pasting the following code into a new cell:

!mkdir modeling

Then, create another new cell and paste the following code into it:

%%writefile modeling/train.py

# Import dependencies
import argparse

import mlflow
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

# Parse arguments
parser = argparse.ArgumentParser()
parser.add_argument("--data", type=str, required=True)
parser.add_argument("--modeldir", type=str, required=True)
args = parser.parse_args()

# Initialize MLflow auto logging
mlflow.sklearn.autolog(
pos_label=True,
log_datasets=False,
log_models=False
)

Putting the %%writefilemagic” at the beginning of the cell tells Jupyter to write the entire content of the cell into the specified file modeling/train.py instead of executing it.

What else is happening in the code? We are importing some libraries we will use later and we are parsing arguments. Which arguments? When Azure ML executes the train.py file as a script on the Compute Cluster, it will call it with arguments, like python train.py --data="titanic.csv" --modeldir="model_dir". These arguments will help Azure ML to supply the Azure ML-managed titanic data asset into the script and point it to a specific storage location for the trained model.

In the very last line of the script, we initialize MLFlow automatic logging. With automatic logging enabled, MLFlow will automatically log some training metrics like accuracy and F1 score of the training data. It will also produce a confusion matrix. Through the keyword arguments log_datasets=False and log_models=False, we will avoid logging the input dataset as well as the trained model twice. Azure ML already takes care of logging these for us since both are supplied as arguments to the script (see previous paragraph). The pos_label is specific to binary classification. We use it to inform MLFLow about which label we perceive as positive (in this case: Survived the Titanic catastrophe!). Learn more about MLFlow automatic logging for Scikit-learn in the MLFlow documentation.

The next code snippet should be in the same Jupyter cell as the code snippet above.

# (Same cell as above)
# Load the dataset
df = pd.read_csv(args.data)

# Do some basic feature selection and preprocessing
df = df[["pclass", "survived", "sex", "age", "sibsp", "parch", "fare"]]
df = pd.get_dummies(df)
df = df.fillna(df.median())

# Split the dataset into features (X) and target (y)
X = df.drop("survived", axis=1)
y = df["survived"]

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)

# Initialize the model
model = RandomForestClassifier(n_estimators=100, random_state=42)

In this part, we load data and do some basic feature selection and preprocessing. We also split the dataset into train and test sets. Finally, we initialize a simple random forest model.

Now we are ready to train and evaluate the model. The next code snippet should again be added to the same cell as the code above.

# (Same cell as above)
with mlflow.start_run() as run:
# Train the model
model.fit(X_train, y_train)

mlflow.sklearn.save_model(
sk_model=model, path=args.modeldir, code_paths=["train.py"]
)

y_pred = model.predict(X_test)

# Import here so metrics can be logged
from sklearn.metrics import (
accuracy_score,
precision_score,
recall_score,
f1_score,
roc_auc_score,
confusion_matrix,
)

accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
roc_auc = roc_auc_score(y_test, y_pred)

Let’s explain what is happening here. First, we start an MLFlow run. A run is the entity through which MLFlow collects data about model training. Everything within the with statement is logged to this specific run.

After we fit the model, we use mlflow.sklearn.save_model to save it. The save_model function can have many different arguments and you can check them out in the documentation. Here, we save the model we just trained to the path defined via args.modeldir. Why this dynamic path? Azure ML will automatically define this path when the script is run in the Compute Cluster. Once the model is saved to the path, Azure ML will be able to extract it and make it available as a model on the Azure ML platform. Through code_path, we can make sure that the saved model includes the training script. This can be helpful for traceability and repeatability.

Finally, we make a prediction on the test data and score it using different metrics. This completes our training script! Now, it is time to instruct Azure ML to run it on a Compute Cluster.

Running Remote Machine Learning Model Training

To execute machine learning model training on a Compute Cluster in Azure ML, we first need to prepare a so-called “command” with instructions about what Azure ML should do and then send it to Azure ML. Here is how to do this, in a new cell:

# Prepare the command for training the model

from azure.ai.ml import command, Input, UserIdentityConfiguration, Output
from azure.ai.ml.constants import AssetTypes

MODEL_NAME = "titanic-rf"

training_component = command(
code="./modeling",
command="python train.py --data ${{inputs.data}} --modeldir ${{outputs.modeldir}}",
inputs={
"data": Input(type=AssetTypes.URI_FILE, path=online_data_asset.id),
},
outputs={"modeldir": Output(type=AssetTypes.MLFLOW_MODEL, name=MODEL_NAME)},
environment="azureml://registries/azureml/environments/sklearn-1.1/versions/21",
compute="all-cpu-small",
description="Training for Titanic Random Forest",
experiment_name=MODEL_NAME,
identity=UserIdentityConfiguration(),
)

There are 3 parts of this command I would like to direct your attention to: The inputs and outputs, the environment, and the compute.

You see that we import the Input and Output classes from the azure.ai.ml module — these are truly special entities. Through the Input, we signal to Azure ML that we will use an Azure ML-managed resource as an input to the command. Note that we reference the “data” field we have assigned the input to in the command argument of the command definition(--data ${{inputs.data}}). In our case, the input is the online_data_asset we just defined earlier, the Titanic Data asset under management by Azure ML. The Output is defined as an MLFlow model. Azure ML will feed in a file path where it expects MLFlow to store the model — this carries forward to the model storage location args.modeldir in the training code, as explained above.

The environment defines the dependencies which are available to the training script. In Azure ML, you can build environments yourself, for example from conda environment.yml files or pip requirements.txt files. In the case at hand, we use a pre-built environment provided by Azure ML. Using a pre-built environment saves time since we do not need to wait until Azure ML has built a container with the desired dependencies before the model training can run. I selected the Azure ML-managed sklearn-1.1 environment because it includes all Python libraries we import in the training script.

Finally, we request “all-cpu-small” as a compute. This is the name of a Compute Cluster we have defined in our Azure ML workspace. You would have to replace this with the name of the compute cluster in your workspace on which you would like to run the training script.

Are we done yet? Almost! In the very last step, in a new cell, we send the command to Azure ML so it can execute it:

# Submit the training command to the cluster and
# show a URL where we can track the run

returned_job = ml_client.jobs.create_or_update(training_component)
returned_job.studio_url

Evaluating Training Results

Once Azure ML has finished running the training job, we can inspect the outcomes in the Azure ML user interface. You can access it through the URL in returned_job.studio_url. Let’s focus on the job Overview, Metrics, and Image tabs.

A screenshot of Microsoft Azure Machine Learnings, showing the overview page for the completed job

Immediately, it should strike your attention how much interesting information is presented here. We see clickable links to the Titanic data asset which we used as an input and to the trained machine learning model which is now also in Azure ML. On the bottom right, we see a lot of information about the input parameters used in the model — we got this for free through MLFlow’s automatic logging functionality.

A screenshot of Microsoft Azure Machine Learnings, showing various metrics

Let’s have a look at the Metrics tab. It shows metrics for both test and training. MLFLow quietly recorded these metrics as we were calculating them in the training script. It is important to mention here that the automatic recording of test metrics only works when we import metrics from sklearn.metrics after we have enabled autologging via mlflow.sklearn.autolog. Now that metrics are stored in Azure ML, we can also use them for comparing different jobs where we might have used different data or models.

A screenshot of Microsoft Azure Machine Learnings, showing a confusion matrix under the “Images” tab, it can be seen that files in addition to the one with the confusion matrix have been created

In the Images tab, we can see insightful plots of important metrics. MLFlow’s automatic logging delivered images of the confusion matrix, the precision-recall curve, and the ROC curve to the run results.

Conclusion

In this article, we have gone through a simple end-to-end machine learning workflow on Azure Machine Learning. The Data, Compute, and Job components have helped us weave the model training process into the cloud environment of Azure ML. We used the powers of automated logging in Azure ML to save data science development work and leveraged MLFlow’s integration with Azure ML to share our training and testing metrics and the model with our team via the Azure ML workspace. Using the template presented in this article, you have a basis for starting to develop your machine learning projects on Azure ML.

Whether shampoo, detergent, or industrial adhesive — Henkel stands for strong brands, innovations, and technologies. In our data science, engineering, and analytics teams we solve modern data challenges for the benefit of our customers.
Learn more at
henkel.com/digitalization.

--

--

Henkel Data & Analytics
Henkel Data & Analytics Blog

Find out how Henkel creates its next digital innovations and tech driven business solutions based on data & analytics. henkel.com/digitalization