Auto Text Classification using Google’s AutoML

Published in

Voice Tech Podcast

9 min readJun 22, 2019

Natural language processing and within this automatic text classification into predefined labels/themes has made a lot of business sense in the past few years. Some of its applications are assigning news articles into labels such as politics, sports, etc; Assigning concerned departments such as IT, HR, Admin to tickets generated from organization level ticketing/incident management tool; Through sentiment analysis assigning customer’s feedback into positive or negative comments. To learn more about Text classification implementation in general through python you can refer my earlier article here.

Organizations looking to implement a production ready NLP model could generally face following high-level issues:

Availability of expert machine learning developers
Limited in-house hardware to build/train complex models
Efficiently scaling trained models to production environment

This is where Google’s Cloud AutoML comes into picture. In this article, I will take you through a step-by-step process to build a custom text classification model by using just the GUI (graphical user interface) of this application.

Introduction

Google has leveraged its years of experience in building state-of-the-art applications in building a platform which allows users with limited exposure to machine learning, to build models which not only gives excellent prediction results but also requires very little development time.

Google’s cloud AutoML platform allows users to build efficient models in following domains. In this article we will focus on its Natural Language application:

Natural Language — Reveal the structure and meaning of text through machine learning.
Language Translation — Dynamically detect and translate between languages.
Vision — Derive insights from images in the cloud or at the edge.
Video Intelligence — Enable powerful content discovery and engaging video experiences.
Tables — Automatically build and deploy state-of-the-art machine learning models on structured data.

Source: https://cloud.google.com/automl/

Setting it up

Before we could start using the tool on our data, we need to follow a simple process to setup the application:

Sign in with your Gmail account on the link https://cloud.google.com/
Click on ‘Get started for free’ and fill the simple signup form. Google will give you 300$ as credit to use any of its services for next 12 months, which is pretty sweet.
Once you have successfully signup-ed, you will land on the below home page of Google cloud platform. From the options mentioned under the ‘ARTIFICIAL INTELLIGENCE’, click on ‘Natural Language’.

4. Next, we need to enable the Natural Language API so that we could leverage it to build our own custom models.

5. In the following page, launch the AutoML Text Classification app.

6. Next, we need to create storage specific to our model, for this just click on the ‘SET UP NOW’ button. Don’t worry about Billing settings- it should be fine too once you proceed with above.

7. And voila! we have setup-ed a project space on the Google cloud platform which we can use to train/build our custom text classification model.

Data Preparation

To build any model, the first ingredient that is required is the DATA. It should have the text inputs that you would like AutoML to classify, and the categories or labels (the answer) you want the ML systems to predict.

Consider the following while sourcing your data for your model:

It is recommend providing at least 1000 training documents per label. The minimum number of documents per label is 10. However, you can improve the confidence scores from your model by using more examples per label.
There should not be any duplicate text inputs, the model will give a warning if there is such a case.
For the text inputs which do not go into any for your predefined themes, consider labelling them as ‘None_of_the_above’. This can improve accuracy of your model.
AutoML also has the functionality to work on Multi label classification Models. You can use this to predict all the labels that go with the text, and not just one label for each input text.

For this article, I will be using a news classification data set. It has news descriptions under the text column, which are classified into four categories — World, Business, Sports and SciTech. You can find this data in my GitHub repository here.

Model Building

We now have everything in place to train an AutoML model for text classification purposes. For this we need to follow a simple process:

Upload the Data

1. Click on the below icon.

2. On the create data set page, enter your data set name -> Select single label Classification -> Finally, browse and upload the csv file. It will take some time to upload the file; once this is done you will get an email on your registered Gmail account.

3. After the file is successfully uploaded, it will show you a report on your data. Report will show number of labelled and unlabelled text; number of labels in each category. Also, if there are any error while uploading the data that will also be shown here.

Below is the report of our news classification data.

Training

To start training your model, you need to click on the ‘Train’ button. This can take several hours based on the complexity and size of your training data set. Once the training is completed you will again be notified via email. In the meantime you can close this window and relax!

This model took about 3.5 hours to train.

Build better voice apps. Get more articles & interviews from voice technology experts at voicetechpodcast.com

Evaluate your model

AutoML uses 10% (by default) of the data for the model’s evaluation, it calculates following three scores, which tells us how well our model is performing on the unseen data.

Avg. precision — This is the AUC-ROC score, which is the area under the precision- recall tradeoff curve. It tells how much the model is capable of distinguishing between classes, higher the AUC, better the model’s prediction power.
Precision — It expresses the proportion of the data points our model says was relevant actually were relevant.
Recall — The ability to find all relevant instances in a dataset.

Now, coming back to our news classification model, AutoML predicted the correct label with 97.4% accuracy, which is pretty good and all we have to do is click a few buttons! This is the true power of Google’s Cloud AutoML!

Predictions

The logical next step is to start using our trained model for predictions on unlabelled data. There are three ways we can use our model which is actually stored on a cloud.

UI — We can use below UI to predict label for one label at a time. This could useful for cases where we do not have many instances of unseen data, however in a production environment or when we have thousands of unseen texts, this method is not of much use.

2. Python — We can use python-based scripts on our own systems to leverage this model to predict labels on a data file which is locally present on the system.

For this we first need to create a service account (a json file,which we will save locally) which will actually help us to securely connect with our cloud based model. Follow the below steps:

Go to the ‘service account’ page of your project, this should be present under the settings tab.

Enter a display name for this service account and click ‘create’

Next, we need to grant this service account access to this project. Select ‘AutoML’ -> ‘AutoML Admin’ -> and again ‘AutoML Admin’. This will give this service account to access the project with administrative rights.

The next step to grant users access is optional, you can just press ‘Done’ for now.
Once all the above steps have been successfully completed you will see that a service account has been created. However, there is no key (the json file) associated with it as of now.

To create the required key, click on the three dots present under the ‘Actions’ tab; Then select ‘JSON’ and click ‘Create’. This will create the json file and it will ask you to save locally on your system.

Click on Create key and save the .json file

Next, we now need to install Google cloud python package; On windows system run the below in your command prompt.

pip install google-cloud

We now have everything to run our script. Save the below script as ‘Predict.py’ along with json file which we have created earlier. Remember to update the below mentioned “GOOGLE_APPLICATION_CREDENTIAL” with name of your json file.

import sysfrom google.cloud import automl_v1beta1 as automl
from google.cloud.automl_v1beta1.proto import service_pb2
from google.oauth2 import service_accountimport os
os.environ["GOOGLE_APPLICATION_CREDENTIALS"]="My First Project-ccf4469590bb.json"def get_prediction(content, project_id, model_id):
  compute_region = 'us-central1'
  automl_client = automl.AutoMlClient()
  prediction_client = automl.PredictionServiceClient()
  name = 'projects/{}/locations/us-central1/models/{}'.format(project_id, model_id)
  model_full_id = automl_client.model_path(project_id,compute_region ,model_id)
  payload = {'text_snippet': {'content': content, 'mime_type': 'text/plain'}}
  params = {}
  request = prediction_client.predict(name, payload, params)
  return request  # waits till request is returnedif __name__ == '__main__':
  content = sys.argv[1]
  project_id = sys.argv[2]
  model_id = sys.argv[3]  print (get_prediction(content, project_id,  model_id))

Now run the below command on your command prompt; Make sure you have set the path to the folder where the you have the python script and the .json file.

Please note ‘carbon-vault-244215’ is the Project ID and ‘TCN7164071333836492456’ is the Model ID, update this with your respective IDs.

python predict.py “Wayne Rooney is Manchester united's legend” carbon-vault-244215 TCN7164071333836492456

Post running the above command we get the below results, it tells us that the text we passed has the 99% probability of being a sports article, which is absolutely correct. Similarly, it also tells us about its probability with respect to other news categories.

I hope you have not forgotten their is a 3rd way too. Lets get on with it :-)

3. REST API — We can also make predictions by doing a POST request to our Models custom API. This will also require a service account and we can use the one we have already created above.

You can use the below script to make the API calls, again, remember to update the below script with you Model and Project IDs. The results will be exactly like the python script.

export GOOGLE_APPLICATION_CREDENTIALS=key-file-pathcurl -X POST \
 -H “Authorization: Bearer $(gcloud auth application-default print-access-token)” \
 -H “Content-Type: application/json” \
 https://automl.googleapis.com/v1beta1/projects/carbon-vault-244215/locations/us-central1/models/TCN7164071333836492456:predict \
 -d ‘{
 “payload” : {
 “textSnippet”: {
 “content”: “Wayne Rooney is Manchester united's legend”,
 “mime_type”: “text/plain”
 },
 }
 }’

End

So that is it from my end folks! I hope you found it easy to work with Google’s Cloud platform. The amount of effort we spent on building a high performing model that too without relying on hardware capabilities was indeed really less and with its implementation through Python or REST API, scalability of the model will not be a concern as well.

If you have any thoughts, suggestions please feel free to comment or if you want, you can reach me at bedigunjit@gmail.com, I will try to get back to you as soon as I can.

You can reach me through LinkedIn too.

Hit the clap button or share it if you like the post.