Moreover, you will learn how to launch an automated machine learning process to allow algorithm selection and hyperparameter tuning. Automated machine learning iterates over many combinations of algorithms and hyperparameters until it finds the best model based on your criterion.
Specifically, you will learn the following tasks:
- Auto-train a regression model.
- Run the model locally with custom parameters.
- Explore your model results.
1. Auto-train a regression model
In this article, I assume you have already downloaded the data from Azure Open Datasets and ran through the data preparation steps in this tutorial for the NYC Taxi data so it could be used to build our machine learning model.
Let’s start by creating our workspace object from the existing workspace.
A Workspace is a class that accepts your Azure subscription and resource information. It also creates a cloud resource to monitor and track your model runs. Specifically, the
Workspace.from_config() reads the file config.json and loads the authentication details into an object named
ws is used throughout the rest of the code in this tutorial:
Split the data into train and test sets
You can start by splitting the data into training and test sets by using the
train_test_split function in the
sklearn library. This function segregates the data into the x, features, dataset for model training and the y, values to predict, dataset for testing.
test_size parameter determines the percentage of data to allocate to testing. The
random_state parameter sets a seed to the random generator, so that your train-test splits are always deterministic:
The purpose of this step is to have data points to test the finished model that haven’t been used to train the model, in order to measure true accuracy. In other words, a well-trained model should be able to accurately make predictions from data it hasn’t already seen.
You now have the necessary packages and data ready for autotraining your model.
2. Run the model locally with custom parameters.
To automatically train a model, take the following steps:
- Define settings for the experiment run. Attach your training data to the configuration, and modify settings that control the training process.
- Submit the experiment for model tuning. After submitting the experiment, the process iterates through different machine learning algorithms and hyperparameter settings, adhering to your defined constraints. It chooses the best-fit model by optimizing an accuracy metric.
Define settings for autogeneration and tuning
In the image below, you can see the full list of settings:
Submitting the experiment with these default settings will take approximately 10–15 min, but if you want a shorter run time, reduce either iterations or iteration_timeout_minutes.
Use your defined training settings as a parameter to an
AutoMLConfig object. Additionally, specify your training data and the type of model, which is
regression in this case:
Train the automatic regression model
Start the experiment to run locally:
- Pass the defined
automated_ml_configobject to the experiment.
- Set the output to
Trueto view progress during the experiment:
The output shown updates live as the experiment runs. For each iteration, you see the model type, the run duration, and the training accuracy. The field BEST tracks the best running training score based on your metric type.
3. Explore your model results
At this point, you can explore the results of automatic training with a Jupyter widget or by examining the experiment history. If you use a Jupyter notebook, use this Jupyter notebook widget to see a graph and a table of all results:
Retrieve the best model
You can select the best pipeline from our iterations. The
get_output method on
automl_classifier returns the best run and the fitted model for the last fit invocation. By using the overloads on
get_output, you can retrieve the best run and fitted model for any logged metric or a particular iteration:
Test the best model accuracy
At this point, you can use the best model to run predictions on the test dataset to predict taxi fares. The function
predict uses the best model and predicts the values of y, trip cost, from the
x_test dataset. Print the first 10 predicted cost values from
In order to calculate the
root mean squared error of the results, you need to use the
y_test dataframe and convert it to a list to compare to the predicted values.
mean_squared_error takes two arrays of values and calculates the average squared error between them. Taking the square root of the result gives an error in the same units as the y variable, cost. It indicates roughly how far the taxi fare predictions are from the actual fares:
The rmse result is 3.2204936862688798
From the final prediction accuracy metrics, you see that the model is fairly good at predicting taxi fares from the data set’s features.
The traditional machine learning model development process is highly resource-intensive, and requires significant domain knowledge and time investment to run and compare the results of dozens of models. Using automated machine learning is a great way to rapidly test many different models for your scenario.