Data Modeling with Google’s AutoML Tables

Katherine Ling
Strategic Marketing Intelligence
6 min readFeb 19, 2021

Anyone who has ever built a machine learning model knows it can be a very labor intensive and time consuming process. The demand for machine learning experts is higher than ever, and supply has not been keeping up.

With its goal to make machine learning more accessible, Google came up with the idea of having “machine-learning software take over some of the work of creating machine-learning software.”

In 2018, Google announced its machine learning cloud product: Cloud AutoML.

AutoML is a suite of machine learning products that allow those with limited machine learning experience to develop and train models more quickly and easily.

Since then, Google has launched:

  • AutoML Vision, to detect objects and classify images
  • AutoML Natural Language, to analyze and categorize texts
  • AutoML Video Intelligence, to detect and classify objects or segments in videos
  • AutoML Translate, to detect and translate between languages
  • and AutoML Tables, to build and deploy machine learning models on structured data

For this blog post, we will take a deep dive into Google’s AutoML Tables. We will try to build a regression model using the Ames Housing data set to determine which variables have the greatest impact on the sales price of a home. Variables include: year that the home was sold, square footage of the garage, whether or not the home has a fireplace or a pool, and much more. The full data set can be found here on Kaggle.

The process was relatively straightforward and fast. With just a few clicks, it is possible to cut out a lot of the leg work that typically goes in to training, running, and evaluating a machine learning model.

Keep in mind that even though AutoML simplifies much of the process that goes into building a model, it is still necessary to do data cleaning on your own before importing your data set.

Step 1: Import Data

The first step is to import the data set that will be used to build the model. AutoML gives 3 options to import:

  • Import through BigQuery
  • Select a CSV file from Cloud Storage
  • Upload a file directly from your computer

Note that if you choose to upload files directly from your computer, you will be required to select a location in Google Cloud Storage to store the file. Again, make sure you clean your data set before importing, as AutoML will not take care of this part for you.

Importing data can take up to 1 hour, depending on the size of your data set. In this example, our dataset consisted of 81 columns and 1,460 rows, and took just under 5 minutes to import. On a different data set with 8 columns and 974,666 rows, the import process took about 20 minutes.

Step 2: Set Parameters for Training

Once the data set gets imported, AutoML will display a summary of all of your columns.

Select the target variable for your model under the “target column.” For our example, we selected “SalePrice” as we wanted to see which variables affected the sales price of a home the most.

One cool feature that comes along with this step is that AutoML will automatically calculate each column’s correlation with the selected target variable. It will also automatically determine whether the columns are numeric or categorical, and calculate missing value, invalid value, and distinct value counts.

You can also select whether or not you’d like missing values to be ignored.

What’s neat is that any encoding of categorical variables is unnecessary, as AutoML will automatically take care of this process for you — saving a lot of time. After verifying that all the information is correct, click on “train model”.

Step 3: Train Your Model

Before you begin training your model, you will need to enter in the maximum node hours. Note that AutoML gives suggested node hour counts based on the number of rows in your data set. Google will automatically select all of your columns in your imported data set, though if you’d like you may choose to deselect certain columns from your model. Once all the parameters are chosen, you can click on “train model” to officially begin the training process.

Training this data set took a little more than 1 hour. For our second data set with 8 columns and 974,666 rows, the training process took 6 hours.

Google will charge $19.52 per hour of model training. Training includes set up, preprocessing, and tear down, which you will not be charged for.

Step 4: Evaluating Your Model

Since this is a regression model, Google AutoML will calculate the MAE, RMSE, R² and MAPE. It is not clear what regression models AutoML uses specifically, though it may be an ensemble of models.

AutoML will also show the feature importance (after clicking on “see full evaluation”), indicating how important each feature is in predicting the target variable. This is one of the best parts of the whole process, as it saves a lot of time. Running recursive feature elimination manually can take several hours!

Step 5 (Optional): Test & Use Your Model

After training your data set and evaluating the best fit model, you will have the option of testing your model by making predictions on other data sets. You may choose between a batch prediction or online prediction. With a batch prediction, the model is tested against a data set stored in BigQuery or Google Cloud Storage. Online predictions, however, allow you to run predictions on an external data set by using a REST API.

In our example, we did not test our model, though keep in mind that they do incur charges. Batch predictions will cost $1.16 per hour of compute resources. Online predictions will cost $0.21 per hour of compute resources, and will also require model deployment. Model deployment will also incur charges. For a more detailed explanation of costs, see Google’s AutoML Table pricing structure.

Final Thoughts

Before using AutoML Tables, we built and tested our models manually. The outcomes that we got were similar to those that AutoML Tables produced, which validates our results and shows the quality of the platform.

AutoML Tables will not help with data exploration or cleaning, which would still need to be manually prior to importing the data set. For those on a low budget, this product can become costly, especially when used with larger models (as training alone costs almost $20/hour and larger data sets can take more than 6 hours to train!). Overall though, AutoML Tables is an impressive product that allows users to save time and build good quality machine learning models in a way that is much cleaner, easier, and time efficient.

Sources: https://www.technologyreview.com/2017/05/17/151652/why-googles-ceo-is-excited-about-automating-artificial-intelligence/

--

--