Using Azure ML Pipelines & AutoML to Classify Chicago AirBnb Listings

Ashley Peterson
Slalom Data & AI
Published in
10 min readJun 9, 2020

Developing a reusable Azure ML Pipeline — by Ashley Peterson, Rachel Wiseley, and Alden Nail

Model results for classifying ‘Top Stays’ and ‘Fixer Uppers’.

Data scientists need numerous “tools in their toolbox” to successfully develop, train, and deploy a model. They need to be able to prepare data for modeling, experiment with many models and parameters, set up a retraining process, and deploy their model to gain insights from the predictions. Each of the steps in the modeling process can be time consuming and require multiple tools and skills. With Azure’s Machine Learning Service, a data scientist can organize their experiments, track results, store and deploy models, and speed up their process. Two tools that enable a data scientist to experiment quicker and set up repeatable processes are Automated Machine Learning (AutoML) and Machine Learning Pipelines (ML Pipelines). AutoML democratizes the machine learning model development process.[1] ML Pipelines allow data scientist to develop repeatable workflows.

In this post, we will leverage Azure’s AutoML and ML Pipelines to train a classification model.

The key tools we used to develop the model are:

  • Azure Machine Learning Service — A cloud-based environment for training, tracking, and deploying machine learning models
  • Azure Pipelines — A framework to create reusable machine learning workflows
  • Azure AutoML — A machine learning model that trains many models according to the input problem type of regression, classification, or forecasting
  • Azure Blob Storage — Azure’s object storage service

First, we will explore the data that we used to train a classification model. Then, we will dive into the pipeline and AutoML capabilities in Azure.*

*Note we have already set up a Machine Learning Workspace for this. To set up your workspace you can follow these instructions.

Overview of Airbnb Dataset and Classification Use Case

If you are looking to predict between two categories such as a win/loss or yes/no, then you would want to use a classification model. Some classification use cases include predicting whether a customer will default on a financial loan, classifying emails as spam, or determining if an image contains a specific item, like a car or snow-leopard. For this demo, we created a ML Pipeline to classify Airbnb properties from Chicago as “top stays” or “fixer-upper”. A “top stay” was defined as a listing where the average rating is greater than or equal to 90 (out of 100 ).

The data contained host-specific as well as listing-specific information that was used in developing the model. Feature selection was used to remove correlated variables; feature engineering was performed to clean up the date and categorical features. This data cleanup is done in the data_prep.py script that is called in the first step of the pipeline.**

**Note: We will not go into the details on the model itself. Instead we will focus on utilizing the AutoML and ML Pipeline features of Azure Machine Learning to quickly set up a machine learning workflow.

Setting up the Machine Learning Pipeline and AutoML Step

Machine Learning Pipelines are one piece of the larger MLOps framework in Azure. ML Pipelines allow you to create reusable workflows and track the results of each run. Within the pipeline, steps modularize the modeling process. The decoupled code enables easier debugging and development.

Our pipeline contains four steps: Data Preparation, Train-Test Split, AutoML, and Model Registration. A benefit of ML Pipelines is the ability to automate the pipeline to run again to retrain your model when new data is made available.

The steps of the ML Pipeline act like an outline for the process. Each step references a script. When the Pipeline is run, the steps call the referenced code and executes it. [2]

ML Pipeline Setup

Before creating the steps of a pipeline, a few items need to be set up or referenced:

  • Create the reference objects for the workspace and the datastore
  • Create a compute cluster (if not already done so with workspace set up) to run the pipeline on
  • Create the directory for the scripts which will be referenced by the steps of the pipeline
  • (Optional) You can reference/read data located in Blob and other Azure services. Additionally, you can register your data to the workspace to create a data reference. In our setup, we registered the raw data to our workspace.
  • Create the RunConfiguration for the scripts. This is where you set up the environment you need for the scripts to run. You can pip or conda install any dependencies. For pipelines and automl, we included azureml-sdk, azureml-dataprep, automl, and azureml-automl-runtime.

Step 1: Data Preparation

Now that our environment is set up, the first step in the Pipeline is the Data Preparation Step. This step pulls the dataset from the workspace and performs transformations and updates to the features. We created a PythonScriptStep to perform the custom data preparation. The step calls a python script, data_prep.py, that contains the python code to do the transformations and clean up.

The key parameters to set in the step are the outputs and arguments. The output is a PipelineData object which enables the next step to read in the cleaned data. The arguments parameter passes information into the script that the step calls. We are passing input_data and output_data as arguments to the data_prep.py script. Within the data_prep.py script, the passed arguments are strings, so we passed the dataset name, airbnb, into the script.

Data Preparation Step

# defined output after data prep step 
cleansed_data = PipelineData("airbnb_cleaned", datastore=dstore)

# create the Python Pipeline step
data_prep_step = PythonScriptStep(name = 'Data Preparation',
script_name = 'data_prep.py',
arguments = ["--input_data",
"airbnb",
"--output_data",
cleansed_data],
inputs =
[airbnb.as_named_input('airbnb_dataset')],
outputs = [cleansed_data],
compute_target = compute_target,
runconfig = run_config,
allow_reuse = True
)

Data_prep.py

Our initial input data is registered to the Workspace and we gave it the name ‘airbnb_dataset’ in the input parameter for the data_prep_step. In the script, we can call the dataset by using the following code:

from azureml.core import Run, Datasetrun_context = Run.get_context()
airbnb_dataset = run_context.input_datasets[‘airbnb_dataset’]
airbnb = airbnb_dataset.to_pandas_dataframe()

After the step runs, the output is saved in the cleansed_data PipelineData object we created.

Step 2: Train Test Split

The second step is the Test Train Data Split Step. The set up for this step is like the Data Preparation step. It is also a PythonScriptStep with code to split the data into training and test sets. The input parameters for this script are the output from the Data Preparation step, the training and test datasets.

# test train split the data 
output_train = PipelineData("output_train", datastore = dstore)
output_test = PipelineData("output_test", datastore = dstore)

test_train_step = PythonScriptStep(name = "Test Train Data Split",
script_name ="train_test_split.py",
arguments = ["--input_data",
cleansed_data,
"--output_train",
output_train,
"--output_test",
output_test],
inputs = [cleansed_data],
outputs = [output_train,
output_test],
compute_target = compute_target,
runconfig = run_config,
allow_reuse = True
)

Step 3: AutoML

We used the AutoMLStep to train our model. This step is setup with parameters like the PythonScriptStep, and it also needs the AutoMLConfig settings. These settings are used to control what parameters you would like to use for your models. You can set the primary_metric to score on, the max time for the experiment to run, the number of cross-validations, etc. This is also where you set what type of modeling you would like to run. AutoML can perform classification, regression or forecasting.

The configuration uses the data_script parameter to read in the PipelineData from the previous step. With the latest update, AutoML uses Datasets that are in the Workspace. In order to read in the PipelineData you need to create a get_data.py file with a function called get_data() that pulls in the data from the previous step.***

***Note get_data() will be deprecated in the future, then the AutoML step will need an input of a Dataset. The commented out label_column and training_data will need to be used.

AutoML Config

import logging
from azureml.train.automl import AutoMLConfig
#label = ‘review_scores_rating_binned’
project_folder = ‘./’
automl_settings = {
“iteration_timeout_minutes”: 2,
“n_cross_validations”: 3,
“primary_metric”: ‘accuracy’,
#“preprocess”: True,
#"featurization": 'auto',
“enable_early_stopping”: True,
“max_concurrent_iterations”: 4,
“max_cores_per_iteration”: -1,
“verbosity”: logging.INFO,
}
automl_config = AutoMLConfig(task = ‘classification’,
experiment_timeout_minutes = 15,
#whitelist_models= [],
#training_data = Dataset.get_by_name(ws, name = ‘air_train’),
#label_column_name = label,
data_script = ‘get_data.py’,
compute_target = compute_target,
path = project_folder,
**automl_settings
)
print(“AutoML config created.”)

Note: Using logging with the AutoMLConfig will provide more details around the errors to help with debugging. You can set the appropriate logging level. For more details, visit this site.

After setting up the AutoMLConfig, you can create the AutoMLStep. The input for the AutoMLStep is the PipelineData that the get_data.py script will use. Because AutoML runs through many combinations of models, this step can take some time to run. You set a limit on the time with the experiment_timeout_minutes and iteration_timeout_minutes parameters. The speed is also impacted by the size of the compute instance you have set up.

AutoML Step

from azureml.train.automl import AutoMLStep
from azureml.pipeline.core import PipelineData, TrainingOutput
metrics_output = ‘metrics_output’
best_model_output = ‘best_model_output’
metrics_data = PipelineData(name = ‘metrics_data’,
datastore = dstore,
pipeline_output_name = metrics_output,
training_output = TrainingOutput(type=’Metrics’))
model_data = PipelineData(name = ‘model_data’,
datastore=dstore,
pipeline_output_name=best_model_output,
training_output = TrainingOutput(type=’Model’))
automl_step = AutoMLStep(
name=’AutoML_Classification’,
automl_config=automl_config,
inputs=[output_train],
outputs = [metrics_data, model_data],
allow_reuse=True
)

This step outputs two PipelineData objects with the metrics and model results to read into the final step.

Step 4: Model Registration

The last step in the pipeline is the Model Registration step. In this step, we register the best model to the workspace for deployment and future use. The model will show in the workspace’s Model tab. The model output from the AutoML step is read in as the input for this step.

Model Registration Step

model_reg_step = PythonScriptStep(name = “Model Registration”,
script_name = “register_model.py”,
arguments = [“ — input_data”,
output_test,
“ — best_model”,
best_model_output,
“ — model_path”,
model_data,
“ — train_data”,
output_train],
inputs = [output_test, model_data,
output_train],
compute_target = compute_target,
runconfig = run_config
)

Putting It All Together

To connect the steps and run the workflow, you need to create a Pipeline. We used the StepSequence to confirm that our steps would execute in order. You can run the Pipeline in the Jupyter Notebook using the experiment.submit(pipeline) function or you can run the pipeline from the User Interface in the Pipelines tab.

from azureml.pipeline.core import Pipeline, StepSequence
from azureml.widgets import RunDetails
# build pipeline
four_steps = StepSequence(steps = [data_prep_step, test_train_step,
automl_step, model_reg_step])
Pipeline = Pipeline(workspace = ws, steps= four_steps)
# run the pipeline
pipeline_run = experiment.submit(pipeline)

Now that the Pipeline is created, we can run it and view the results of the AutoML step and confirm the model registered to the workspace.

You can track the progress of a run in the experiments tab and see what steps have completed successfully. If a step fails, you can investigate the logs and output that is generated. You can add your own notes and logs within the scripts for debugging and they will print in the log files of the step.

AutoML Output

The pipeline has one output, the model that is registered to the workspace, and it also produces details on the model runs. Once the AutoML step has completed you can explore the models that were run and their scores. The best model will be selected based on the accuracy metric you defined. Under the models tab you can explore the other models that ran along with additional accuracy metrics.

Two additional resources are the Visualizations and Explanations. For classification, the Visualizations detail the model results in the Precision Recall, ROC, Calibration Curve, Lift Chart, Gain Curve and the Confusion Matrix.

The Explanations tab allows you to run an additional experiment to return the details of the model. This allows you to explore the feature importance. An option for AutoML is enabling auto featurization which we leveraged for our model development. The top 8 created features can be seen below in the variable importance graph.

These two tools allow a data scientist to quickly explore and interpret the model results instead of having to spend time creating these charts after a model runs.

Next Steps

With the reusable Pipeline, you can set up your model retraining schedule or re-run the Pipeline on demand. The next steps would be to publish the pipeline for future retraining and deploy the model to an API endpoint to use for predictions.

Machine Learning Pipelines and AutoML enable data scientists to quickly train and experiment with many models. Instead of having to write code for each algorithm they want to test, they can specify the modeling type or the specific list of models to try. Developing the Machine Learning Pipeline allows for integration into production and deployment of a model. Once you have a model framework set up, you can set a retraining schedule or use the model framework with a new dataset and update only pieces of the pipeline. Azure Machine Learning speeds up the modeling process and allows a data scientist to experiment quicker.

--

--