Image Source:https://globalcloudplatforms.com/2020/11/07/introducing-document-ai-platform-a-unified-console-for-document-processing/

GCP’s Natural Language AutoML magic for document classification

Daisy
Google Cloud - Community
5 min readAug 25, 2022

--

Text and document classification is a very prevalent ML use case in the industry where a considerable amount of textual information are driving source for every sector. Be it retail, healthcare, e-commerce, automobile, banking, or finance, there will be always a use case that requires text and document classification. The input might be in an image, text paragraphs, HTML page, or PDF format.

PDF and images are the most common industry document format used. Classifying a text snippet is easier than classifying a PDF or an image document. Classifying a PDF or an image of text comes with the challenges of extracting the text from the pdf or image. To achieve this, the probable solution for a developer would be to use an OCR engine to extract the text before classification. But adding an OCR engine to your pipeline has the overhead of an extra effort affecting both the development and response time of the solution. Also, not to ignore the performance of classification would be dependent on the performance of the OCR extraction. Also, many a time, it is seen that after the text extraction, the original format of the text is not preserved, which might result in a loss of information for classification if the document’s structure was an essential feature in your use case for classification.

This is where Google’s Natural Language Text and Document classification model turns out to be a magical solution. Google’s Natural Language Text and Document Classification AutoML model enables developers with limited machine learning expertise to train and deliver high-quality models specific to customer business needs and build own custom machine learning model in minutes. Google Cloud’s AutoML solution allows us to directly use PDF files as a classification input to the AutoML model removing the need to perform any text extraction before classification. I have used AutoML for my use case where the dataset was highly unstructured, and a lot of variety was present within a class. However, still, the performance of the AutoML model was 96% during training and above 90% for non-trained test samples. I recommend using AutoML for text and document classification, especially when processing PDF or image documents.

To train your custom model, you provide representative samples of the type of documents you want to analyze. The developer needs to remember that the quality of your training data strongly impacts the effectiveness of the model you create, and by extension, the quality of the predictions returned from that model. So, try to include as much variety as possible for your use case.

How to use GCP AutoML for classifying Textual Documents

  1. Login to your GCP account
  2. In the left panel of GCP product, find Natural Language.
  3. Go to Natural Language API
  4. Select AutoML text and document classification
  5. Select the new dataset button on top
  6. Add a dataset name, select region, select single label classification if it is a use case where only one label is needed per document, select multilabel classification if more than one label exists for a document, and add the region where you want your dataset to create and click create dataset. A detailed description of dataset creation can be found at https://cloud.google.com/natural-language/automl/docs/datasets.To create a dataset through code, follow https://cloud.google.com/natural-language/automl/docs/datasets#python
  7. Once the dataset is created, you will see it gets added to the dataset page.
  8. Click on the dataset you created and go to the import tab.
  9. There are three ways you can import your dataset i) by uploading a CSV (the CSV should have rows with the GCS URI of the file comma separated by its class name) from your local, ii) by uploading a CSV (the CSV should have rows with the GCS URI of the file comma separated by its class name) from Google Cloud Storage(GCS), iii) by directly uploading a zip folder(the zip folder must have subfolders with the class name and inside each subfolder, the documents for that class must be present) from local. Follow https://cloud.google.com/natural-language/automl/docs/datasets to know more about dataset import. To import your data through code, follow https://cloud.google.com/natural-language/automl/docs/datasets#python
  10. During the data import, you also need to provide a destination bucket in GCS which AutoML uses as an output bucket while importing the dataset.
  11. Click import. The importing of datasets usually takes time. When the data is imported, you will get an email notification to your GCP account.
  12. After the import is over, you will be able to see your files with assigned class labels in the items tab inside your created dataset.
  13. Once the labels are available, click the train tab and start training. The training takes time, and once the training is completed, you will receive an email notification. The required training time depends on several factors, such as the size of the dataset, the nature of the training items, and the complexity of the models. To train your model through python code, follow https://cloud.google.com/natural-language/automl/docs/models#python. Follow the documentation https://cloud.google.com/natural-language/automl/docs/models to have more understanding of AutoML model training.
  14. After the model is trained, you go to the test and use tab and deploy your model. The model deployment takes a few minutes, for which you receive an email notification once deployed.
  15. Once the model gets deployed, you can test the model using the GCP AutoML GUI by selecting a file from GCS and clicking the predict button in the test and use tab. For more details on prediction, follow https://cloud.google.com/natural-language/automl/docs/predict. For getting the model’s prediction through script, you can use Google's AutoML client library and call the model’s prediction endpoint in your code. The response is in JSON format. Documentation for using python client library https://cloud.google.com/natural-language/automl/docs/predict#python. AutoML also gives you the flexibility to perform a batch prediction. If you want to have a batch prediction follow https://cloud.google.com/natural-language/automl/docs/predict#python
  16. To check the model’s performance matrix, go to the evaluate tab. AutoML Natural Language also provides an aggregate set of evaluation metrics indicating how well the model performs overall, as well as evaluation metrics for each category label, indicating how well the model performs for that label. You will find the model’s performance on the test data from the train test split performed during training. AutoML performs an auto train, test, and validation split of 80–10–10 on the input dataset. You can also edit the confidence threshold of your model to adjust the precision and recall value per your use case. An explanation of the performance matrix used for AutoML can be found at https://cloud.google.com/natural-language/automl/docs/evaluate.To get the evaluation matrix through code, follow https://cloud.google.com/natural-language/automl/docs/evaluate#python

In my next blog, I will explain the synchronous and asynchronous calls on AutoML and how to manage AutoML’s long-running operations.

Happy Learning. Keep Following!!!

--

--