Francesca Lazzeri
Jun 24 · 5 min read

In this article you will learn how to automatically generate a regression model to predict taxi fare prices by using automated machine learning capabilities within Azure Machine Learning service.

Moreover, you will learn how to launch an automated machine learning process to allow algorithm selection and hyperparameter tuning. Automated machine learning iterates over many combinations of algorithms and hyperparameters until it finds the best model based on your criterion.

Specifically, you will learn the following tasks:

  1. Auto-train a regression model.
  2. Run the model locally with custom parameters.
  3. Explore your model results.

1. Auto-train a regression model

In this article, I assume you have already downloaded the data from Azure Open Datasets and ran through the data preparation steps in this tutorial for the NYC Taxi data so it could be used to build our machine learning model.

Configure workspace

Let’s start by creating our workspace object from the existing workspace.

A Workspace is a class that accepts your Azure subscription and resource information. It also creates a cloud resource to monitor and track your model runs. Specifically, the Workspace.from_config() reads the file config.json and loads the authentication details into an object named ws.

ws is used throughout the rest of the code in this tutorial:

Split the data into train and test sets

You can start by splitting the data into training and test sets by using the train_test_split function in the sklearn library. This function segregates the data into the x, features, dataset for model training and the y, values to predict, dataset for testing.

The test_size parameter determines the percentage of data to allocate to testing. The random_state parameter sets a seed to the random generator, so that your train-test splits are always deterministic:

The purpose of this step is to have data points to test the finished model that haven’t been used to train the model, in order to measure true accuracy. In other words, a well-trained model should be able to accurately make predictions from data it hasn’t already seen.

You now have the necessary packages and data ready for autotraining your model.

2. Run the model locally with custom parameters.

To automatically train a model, take the following steps:

  • Define settings for the experiment run. Attach your training data to the configuration, and modify settings that control the training process.
  • Submit the experiment for model tuning. After submitting the experiment, the process iterates through different machine learning algorithms and hyperparameter settings, adhering to your defined constraints. It chooses the best-fit model by optimizing an accuracy metric.

Define settings for autogeneration and tuning

In the image below, you can see the full list of settings:

Submitting the experiment with these default settings will take approximately 10–15 min, but if you want a shorter run time, reduce either iterations or iteration_timeout_minutes.

Use your defined training settings as a parameter to an AutoMLConfig object. Additionally, specify your training data and the type of model, which is regression in this case:

Train the automatic regression model

Start the experiment to run locally:

  1. Pass the defined automated_ml_config object to the experiment.
  2. Set the output to True to view progress during the experiment:

The output shown updates live as the experiment runs. For each iteration, you see the model type, the run duration, and the training accuracy. The field BEST tracks the best running training score based on your metric type.

3. Explore your model results

At this point, you can explore the results of automatic training with a Jupyter widget or by examining the experiment history. If you use a Jupyter notebook, use this Jupyter notebook widget to see a graph and a table of all results:

Retrieve the best model

You can select the best pipeline from our iterations. The get_output method on automl_classifier returns the best run and the fitted model for the last fit invocation. By using the overloads on get_output, you can retrieve the best run and fitted model for any logged metric or a particular iteration:

Test the best model accuracy

At this point, you can use the best model to run predictions on the test dataset to predict taxi fares. The function predict uses the best model and predicts the values of y, trip cost, from the x_test dataset. Print the first 10 predicted cost values from y_predict:

In order to calculate the root mean squared error of the results, you need to use the y_test dataframe and convert it to a list to compare to the predicted values.

The function mean_squared_error takes two arrays of values and calculates the average squared error between them. Taking the square root of the result gives an error in the same units as the y variable, cost. It indicates roughly how far the taxi fare predictions are from the actual fares:

The rmse result is 3.2204936862688798

From the final prediction accuracy metrics, you see that the model is fairly good at predicting taxi fares from the data set’s features.

The traditional machine learning model development process is highly resource-intensive, and requires significant domain knowledge and time investment to run and compare the results of dozens of models. Using automated machine learning is a great way to rapidly test many different models for your scenario.

Resources

Microsoft Azure

Any language. Any platform. Our team is focused on making the world more amazing for developers and IT operations communities with the best that Microsoft Azure can provide. If you want to contribute in this journey with us, contact us at medium@microsoft.com

Francesca Lazzeri

Written by

Senior Machine Learning Scientist at Microsoft. My goal is to make developers like you awesome at applied AI and Machine Learning. @frlazzeri on Twitter.

Microsoft Azure

Any language. Any platform. Our team is focused on making the world more amazing for developers and IT operations communities with the best that Microsoft Azure can provide. If you want to contribute in this journey with us, contact us at medium@microsoft.com

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade