This Microsoft Build Session Blew My Mind

Farooq Mahmud
Analytics Vidhya
Published in
11 min readMay 25, 2020

I was happy to see that Microsoft held its Build conference despite the COVID pandemic gripping the world. While it’s no substitute for the in-person version, it was still pretty damn good. Machine Learning was an evident focus, which is good because that is on my list of things to learn this year. The session titled Azure Machine Learning in Action by Sarah Guthals and Francesca Lazzeri showcased the automated machine learning feature of Azure Machine Learning (AML). It’s hard for me to be blown away, but this session did just that. The ease with which models are trained and evaluated cannot be appreciated until you see automated machine learning in action.

The Scenario

The scenario calls for predicting the number of bike rentals on a given day. The steps in this article are identical to the ones in this Microsoft Learning module on automated machine learning (AutoML). There is a big difference, however. While the Microsoft module offers a no-code solution, I will be doing everything through code. Why? Because coding will help you understand AML better. It will also come in handy when you need to automate this stuff. It’s good to “automate all the things.”

You will learn the following by reading this article:

  1. How to set up an Azure Machine Learning (AML) workspace.
  2. How to store a data set in AML.
  3. How to use automated machine learning (AutoML) to train and evaluate models based on the data set.
  4. How to store the best performing model in Azure Machine Learning.
  5. How to expose the model as a web service.
  6. How to send data to the web service and get the predicted results.

Machine Learning in One Paragraph

This equation easily explains machine learning as a concept:

y=f(x)

Suppose you want to apply a linear regression algorithm to a data set. The variable x represents the attributes of the data set that you want the algorithm to consider when making a prediction. These attributes are called features in machine learning parlance. The variable y is the predicted value, called a label in the language of machine learning. Therefore, f is the model whose inputs are the features, and the output is the label. Machine Learning is about finding the best f so that when given a new set of x's it can predict accurate values for y.

There are two types of labels — numeric and text. A regression algorithm usually generates numeric labels. A classification or clustering algorithm usually generates text labels.

Best and accurate are subjective, unscientific words so how do we determine the best f? As we will see, AutoML lets you specify a metric that serves as the precise definition of best and accurate.

Real-World Machine Learning

Machine learning in the real world is not easy. Do an Amazon search on machine learning books, and you get over 6,000 results. Many universities offer advanced degrees in machine learning. Finding that f is hard.

Automated Machine Learning

Imagine if you will a system that codifies the steps a machine learning engineer takes when trying to come up with the best machine learning model. The system can try out different algorithms. Each algorithm has many parameters, called hyperparameters, that affect the algorithm’s performance. The system can try out different parameters. Because the system runs in the cloud, it can evaluate several algorithms in parallel. The bottom line is that if you have only a conceptual understanding of machine learning, AutoML can get you far.

What Is Codified?

A lot. The machine learning process is complex. Here are the usual tasks:

  • Collect the data from various sources, i.e., ingestion.
  • Analyze the data and figure out relationships between features and labels, i.e., exploration.
  • Get the data in a state suitable for modeling, i.e., preprocessing.
  • Model training and evaluation.
  • Model deployment.
  • Profit.

AutoML primarily automates the exploration, preprocessing, training, and evaluation tasks, which incidentally are the tasks that require the most machine learning knowledge and time.

Setup an Azure Machine Learning Workspace

The AML workspace is your one-stop-shop for all things machine learning. It is where data sets, experiments, models, and web service metadata are stored. It is also the gateway to the compute resources that do the processing. Lastly, it hosts AzureML Studio, a portal where you can visually work with machine learning.

Creating a workspace is easy using the Azure CLI:

Start a PowerShell session and log in to your Azure account:

az login

Set the subscription you want to use:

az account set --subscription [your subscription name]

Create a resource group:

$rgName = [your resource group name]
az group create `
--name $rgName `
--location [your resource group location]

Install the AzureML Extension:

az extension add -n azure-cli-ml

Create an AML workspace:

$workspaceName = [your workspace name]
az ml workspace create `
--workspace-name $workspaceName `
--resource-group $rgName `
--sku enterprise

Note: AutoML only works with the Enterprise SKU of AML.

After the workspace is created, take a look at the resources in the resource group. Obviously, there is a machine learning resource. There is also a storage account. This storage account is needed for distributed data processing to work. There are also Key Vault and Application Insights resources.

Click the machine learning workspace resource and click the Launch Now button to open the AzureML Studio.

Setup a Compute Cluster

Let’s provision a compute cluster that will be used to train models:

az ml computetarget create amlcompute `
--name aml-cluster `
--min-nodes 2 `
--max-nodes 2 `
--vm-size Standard_DS2_v2 `
--workspace-name $workspaceName `
--resource-group $rgName

Ingest and Register a Dataset

Let’s perform the first step of the machine learning process, namely data ingestion. The AzureML Python SDK provides a wonderful way to work with AzureML.

When it comes to Python, I prefer to use PyCharm but the code examples will work in any IDE. Now let’s go over the developer environment setup.

Install Anaconda

The Python SDK seems to work best with Anaconda so install the 64-bit version first using this link.

Create a New Anaconda Environment

Per Python best practice, create and activate the new Anaconda environment which you will use to install subsequent packages. Create a folder to hold your Python code files and run the following shell commands in that folder:

conda create --name [environment name] python=3.7
conda activate [environment name]

Note: PyCharm can create an Anaconda environment for you as part of the Python project creation process.

Install the SDK

Install the AzureML Python SDK, the AzureML Dataprep package along with the pandas extra package:

pip install azureml-sdk azureml-dataprep azureml-dataprep[pandas]

Now that the preliminaries are addressed, create a new Python code file named automlworkspace.py with the following code:

from azureml.core import Workspace, Datasetdataset = Dataset.Tabular.from_delimited_files(
path="https://raw.githubusercontent.com/MicrosoftDocs/mslearn-aml-labs/master/"
"data/daily-bike-share.csv"
)workspace_config = dict(
name="[your workspace name]",
subscription_id="[your subscription id]",
resource_group="[your resource group name]",
)workspace = Workspace.get(**workspace_config)bikes_dataset = dataset.register(
workspace,
"bikes_dataset",
create_new_version=True
)

The Python SDK is surprisingly straightforward. First, the data set is downloaded. Next we need to register the dataset in our AzureML workspace. When doing anything workspace related a reference to the workspace is required. This reference is obtained via the Workspace.get() function. Then we call the register()function on the Dataset object to save the dataset in the workspace.

After running this code, go to AzureML Studio and verify the dataset is present:

Train a Model

In the context of AutoML, training a model means running an experiment that evaluates various algorithms against a metric of your choosing. At the conclusion of the experiment, AutoML will give you the best performing algorithm based on the metric.

We will use the AzureML Python SDK to create and run the experiment. Before writing that code, we should refactor the existing code. The complete automlworkspace.py is shown below:

from azureml.core import Workspace, Dataset, ComputeTarget
from azureml.data import TabularDatasetdef ingest_from_url(
url: str, dataset_name: str, ws: Workspace
) -> TabularDataset:
dataset = Dataset.Tabular.from_delimited_files(path=url)
return dataset.register(ws, dataset_name, create_new_version=False)def get_workspace(name: str, subscription_id: str, resource_group: str) -> Workspace:
workspace_config = dict(
name=name, subscription_id=subscription_id, resource_group=resource_group,
)return Workspace.get(**workspace_config)def get_cluster(name: str, ws: Workspace) -> ComputeTarget:
return ComputeTarget(ws, name)if __name__ == "__main__":
workspace = get_workspace(
"[your workspace name]",
"[your subscription id]",
"[your resource group name]",
)bikes_dataset = ingest_from_url(
"https://raw.githubusercontent.com/MicrosoftDocs/mslearn- aml-labs/master/"
"data/daily-bike-share.csv",
"bikes_dataset",
workspace,
)

Add a function after the get_cluster() function that runs an AutoML experiment:

from azureml.train.automl import AutoMLConfig
from azureml.core import Experimentdef run_automl_experiment(name: str, config: AutoMLConfig, ws: Workspace):
automl_experiment = Experiment(ws, name)
run = automl_experiment.submit(config)
run.wait_for_completion(True)

In the main() function, add the code that creates an AutoML configuration and passes it to the run_automl_experiment() function:

cluster = get_cluster("aml-cluster", workspace)automl_config = AutoMLConfig(
name="Automated ML Bike Training Experiment",
task="regression",
compute_target=cluster,
training_data=bikes_dataset,
label_column_name="rentals",
primary_metric="normalized_root_mean_squared_error",
max_concurrent_iterations=2,
featurization="auto",
model_explainability=True,
)run_automl_experiment("bikes_automl_experiment", automl_config, workspace)

Since we want the experiment to run on the compute cluster, we need to obtain a reference to the cluster with the get_cluster() function. Next the experiment's configuration is created. Here are explanations for the most important parameters:

  • label_column_name: This is the attribute we want the model to predict.
  • primary_metric: The metric by which the algorithm will be scored. In the case of root mean squared error, the lower the value, the more accurate the predictions.
  • max_concurrent_iterations: Set this to the number of nodes in the cluster for maximum concurrency.
  • featurization: This is best explained in the MSDN documentation, but in a nutshell this performs preprocessing that makes the data more suitable for certain algorithms.
  • model_explainability: Setting this to true will give you a breakdown of which features most impact the model.

Refer to MSDN for more information about AutoMLConfig parameters.

The above code will run for at least 30 minutes. You will see the experiment’s output in the Python console. When the experiment concludes, the best performing algorithm is easiest seen in AzureML Studio:

  1. Click Experiments in the left pane.
  2. Click the experiment run.
  3. The best model is shown. In my case it is the Voting Ensemble algorithm.
  4. Click the Models tab and behold all the algorithms AutoML tested for you.
  1. Click the VotingEnsemble link to get the algorithm’s particulars.
  2. Click the Visualizations tab to see the algorithm’s performance.
  3. Click the Explanations tab to see which features most impact the model.

Imagine doing all this work yourself! Are you blown away yet?

Where Are We?

At this point we have a trained regression model that we are happy with. Now we can perform the final step — model deployment. Once deployed, we can send data to it, get back predictions, and profit. Let’s go!

Deploy the Model

In AzureML, models can be deployed to a web service, i.e. a REST API. The API can be hosted in either Azure Container Service (ACS) or Azure Kubernetes Service (AKS). We will use ACS because it is simpler and will suit our needs just fine.

Before we can deploy the model, we need to store it in our workspace, i.e. register the model. Create a Python file named registermodel.py and add the register_model() function:

def register_model(name: str, run: AutoMLRun):
run.register_model(model_name=name)

The function needs a model name which can be anything you like. The function also needs a reference to the AutoML Run. Add the following function which gets the AutoMLRun reference:

def get_automl_run(experiment_name: str, run_id: str, ws: Workspace) -> AutoMLRun:
experiment = Experiment(ws, experiment_name)
return AutoMLRun(experiment, run_id)

The complete registermodel.py is shown below:

from azureml.core import Workspace, Experiment
from azureml.train.automl.run import AutoMLRunfrom automlworkspace import get_workspacedef register_model(name: str, run: AutoMLRun):
run.register_model(model_name=name)def get_automl_run(experiment_name: str, run_id: str, ws: Workspace) -> AutoMLRun:
experiment = Experiment(ws, experiment_name)
return AutoMLRun(experiment, run_id)workspace = get_workspace(
"[your workspace name]",
"[your subscription id]",
"[your resource group name]",
)r = get_automl_run(
"bikes_automl_experiment", "[your run ID]", workspace
)register_model("bike-rentals-automl", r)

Run the above code and go to AzureML Studio. Click Models and observe the newly registered model appears in the list.

Now that the model is registered, it can be deployed. The deployment requires two files — the scoring file and the Anaconda environment file. AutoML created these files for us. Both of these files are in the algorithm’s output folder. This is important. There are two levels of output. The first is the output for the whole run. Subordinate to that are the outputs for each algorithm AutoML tested. We need to download the files from the output folder of VotingEnsemble algorithm:

  1. Click Experiments.
  2. Click the experiment run.
  3. Click the VotingEnsemble link to get the algorithm’s particulars.
  4. Click the Outputs tab.
  5. Expand the outputs folder and you should see the scoring and Anaconda environment file.
  6. Download these files to the root of the directory containing your Python code files.
  7. Open the scoring file and change the model name on line 40 to bike-rentals-automl. Save the file.

Create a Python file named deploymodel.py and add the following code:

from azureml.core import Model
from azureml.core.model import InferenceConfig
from azureml.core.webservice import AciWebservicefrom automlworkspace import get_workspaceworkspace = get_workspace(
"[your workspace name]",
"[your subscription id]",
"[your resource group name]",
)model = workspace.models["bike-rentals-automl"]inference_config = InferenceConfig(
"scoring_file_v_1_0_0.py", "python", "conda_env_v_1_0_0.yml"
)deployment_config = AciWebservice.deploy_configuration(1, 1)service = Model.deploy(
workspace,
"bike-rentals-prediction-service",
[model],
inference_config,
deployment_config,
)service.wait_for_deployment(True)

The code above creates an InferenceConfigwhich specifies the scoring file and Anaconda environment file the service will use. Next an ACS hosted web service is created with one CPU core and one GB of memory. Finally the model is deployed.

The code above will run for at least 15 minutes. After the model is deployed go to AzureML Studio and click Endpoints. Observe the newly deployed service that appears in the list. Click the service and if the deployment went as expected, the Deployment State should be Healthy.

On the same page, the REST endpoint and Swagger URLs are displayed. Copy the REST endpoint URL as it will be needed when we call the web service.

Consume the Web Service

Alright, let’s send some data to the web service and get some predictions! Create a new Python file named consumewebservice.py and add the following code:

import json
import requestsfeatures = [
[1, 1, 2022, 1, 0, 6, 0, 2, 0.344167, 0.363625, 0.805833, 0.160446],
[2, 1, 2022, 1, 0, 0, 0, 2, 0.363478, 0.353739, 0.696087, 0.248539],
[3, 1, 2022, 1, 0, 1, 1, 1, 0.196364, 0.189405, 0.437273, 0.248309],
[4, 1, 2022, 1, 0, 2, 1, 1, 0.2, 0.212122, 0.590435, 0.160296],
[5, 1, 2022, 1, 0, 3, 1, 1, 0.226957, 0.22927, 0.436957, 0.1869],
]features_json = json.dumps({"data": features})
headers = {"Content-Type": "application/json"}
response = requests.post(
"[your REST endpoint url]",
features_json,
headers=headers
)print(response.json())

The code is straightforward. The web service receives five rows of features meaning we should get five predictions back. You can refer to the dataset’s schema in AzureML studio to understand what the numbers mean.

Run the code and observe the JSON result (your values may be slightly different):

{
"result": [
462.3864581844958,
429.3780786470909,
109.77820192223943,
134.91105938680312,
99.22042685465749
]
}

As expected we get five predictions back. The model predicts a high number of rentals the first two days followed by a severe dropoff. I wonder why? This is where referring back to the model explanation in AzureML Studio can shed some light.

Note: Since we called a REST API, the calling code does not have to be Python. Also note that the Python code above has no dependencies on the AzureML Python SDK.

Cleanup

Don’t forget to delete the resource group to avoid incurring additional costs:

az group delete --name [your resource group name] --yes

Conclusion

If you’ve made it this far, thank you for reading. I hope you are as blown away as I am over the power of AzureML!

--

--

Farooq Mahmud
Analytics Vidhya

I am a software engineer at Marel, an Icelandic company that makes machines for meat, fish, and poultry processing.