Text Classification using AutoML Tables|Google Cloud Platform

Nishit Kamdar
Google Cloud - Community
7 min readFeb 15, 2022

What is AutoML Tables?

AutoML Tables is a supervised machine learning service that uses tabular data to build and deploy state-of-the-art machine learning models on structured data at massively increased speed and scale.

It is an extension to GCP’s core AutoML platform that abstracts custom model building tasks like feature engineering, hyper-parameter tuning, ML model selection, and provides an automated way of building the right model by just providing relevant data.

To start with, let’s take a simple example of how AutoML Tables can be used.

The image below shows a lists of features required (shown in Column 1 to 6) that helps in determining the price of the house, shown in the column final_price towards the right.

This house sales dataset can be configured as a source to AutoML tables service to automatically build and provision a classification model that can predict the price of house based on the input feature column values.

Under the hood, the entire heavy lifting of input data analysis, feature engineering, model selection, hyper-parameter tuning and evaluation is fully managed by the AutoML platform without any manual interventions. The deployment of model is also a UI driven no-code approach to creating a REST based scalable microservices endpoint.

AutoML — under the hood.

One common misconception about AutoML tables is that you can only have numbers in the cells of the table. But in reality, AutoML tables can also work with various other datatypes like timestamps, long text descriptions, arrays, along with numbers. We will be using long text descriptions attributes in our example to build the AutoML Classification model.

Why use AutoML and What are we solving for?

Problem Statement: Customer receives electronics inventory data from various vendors. The data consists of structured tabular data on electronic goods like mobile phones, laptops, head-phones, scanners etc, each with its complex long text product descriptions, models, versions, sizes SKUs etc. Each of these line-items needs to classified into its correct category code. This code is then used for critical down-stream processes like planning, stock and inventory forecasting, billing and payment etc and therefore it needs to be very accurate.

Current Solution : The in-house Data scientist team built a Custom ML classification model starting with one Vendor data but as they starting adding more vendor data to it, the accuracy levels started dropping below 80% due to complex and overlapping nature of product descriptions across vendors.

To summarise the key challenges:

  1. Cost and Scale: Limited data-science resources for custom model builds per vendor. Would require additional investments to ramp up and scale to add new vendors.
  2. Speed: Long custom build cycle of 3 to 6 months affecting the speed to empanel and roll-out new vendors.
  3. Accuracy: Low accuracy of the model built with data across all vendors.

GCP’s AutoML Tables to the rescue!

Having fully understood the key challenges and qualified it to be a Machine Learning Classification problem on well formatted tabular structure data, we proposed Google Cloud AutoML Tables to address the challenges highlighted above:

  1. Cost: Its a No-code ML platform and does not require Data Science skills.
  2. Speed: AutoML Tables can create models in the matter of hours to days as opposed to months and year.
  3. Scale: The platform can be configured to scale for new vendor enrolments automatically by setting up model retraining MLOps pipelines.

So, how did we go about it!

Step 1. Creating AutoML Tables Dataset :

The first step in the process is to create the dataset that will be used as an input. This step requires data exploration, cleansing, formatting and transforming it to the structured tabular format.

Following is the sample dataset of electronics items & their features (all text!) which uniquely defines each item in its feature columns and maps it to the corresponding category code:

Dataset

Navigate to GCP Console → Vertex AI → Datasets → Create dataset

  1. Provide a name of the Dataset — the dataset will be referenced by this name throughout the process.
  2. Select Tabular as the data type and Classification as the Objective.
  3. Select region.
  4. Click Create.

3. Setting up the Data Source :

AutoML Tables platform provides multiples options to configure the data source, e.g upload the data file, select it from GCP storage or specify data that is already stored as Bigquery tables.

Data Analysis: Once the data source is mapped, AutoML tables will show a summary of your dataset viz. table columns, no of rows, location etc. You can also optionally click Generate statistics link and it will show the column level statistics like missing and distinct values.

4. Training:

Click the ‘Train New Model’ button in the above screen to initiate the training.

Dataset will be already selected. Select Classification as the Objective, and AutoML as the Training option and click Continue.

Specify a Model name and Select the Target Column which will be outcome of the model prediction — Category_Code in our case. You can also explore the ‘Advance Options’ which provides ways of splitting the training and test data.

Next, AutoML tables provides options to select optimisation objectives and weight column. You can keep it as default and the platform will determine the weightage itself.

Click ‘Continue’.

The next page will allow you to input the number of node hours that the platform will use for training the model. This is critical from a cost standpoint as the higher the node hours, the more it will cost to train the model. A recommendation based on the number of rows is also provided as a best practice.

Input the number of training hours and click on ‘Start Training’.

This will kick start the training job and the platform will start executing underlying machine learning lifecycle development process.

To know the status of the training job, click on Vertex AI → Training.

Each status change sends a notification email to let the user know if the training is success or has failed.On successful completion of the training, the status will change to ‘Finished’ as shown above.

5. Model Evaluation and Test:

At this stage, the model creation is complete. Click on the model and it will take you to the Model details page.

The first tab ‘Evaluate’ provides various model performance evaluation statistics, confusion matrix and feature attribution details.

Model Metrics
Confusion Matrix

The Feature attribution chart provides a view of features that contribute to the predictions for each given instance.

Feature Attribution
Feature Attribution

6. Model Deployment and Test — Online Predictions

Click on the Deploy & Test Tab .

The deploy and test page provides a “Deploy to endpoint” to automatically create and deploy a model as an online REST microservice endpoint.

It provides with additional configurations around traffic split, number of nodes, autoscaling, node specification, monitoring etc.

Once you have configured the same, Click on ‘Deploy’ to create an endpoint.

Click on Vertex AI → Endpoints and you will see the deployed model endpoint which is ready to be consumed.

Sample REST request/response:

7. Batch Predictions:

AutoML tables also supports Batch predictions which can be configured to run the model predictions on the batch of input dataset .

What was the Outcome?

Through GCP AutoML tables, we were able to build and deliver a Classification model covering all the vendors, quickly in a matter of days, without additional costs and with the accuracy of 98+%.

Summary and Conclusion:

Google Cloud AutoML Tables is a state-of-the-art machine learning platform to build and deploy models on structured data at massively increased speed, scale and accuracy.

It is however not a one-stop shop Machine learning solution for every need and therefore needs to be qualified to get the best outcomes for your customer.

For a complete view of the Google Cloud AI platform — please visit https://cloud.google.com/vertex-ai

Hope this was helpful and all the best with your AutoML table workloads!

--

--

Nishit Kamdar
Google Cloud - Community

Data and Artificial Intelligence specialist at Google. This blog is based on “My experiences from the field”. Views are solely mine.