Using the efficacy of Custom Document Classifier of Document AI
Introduction to Document AI
Document AI is a cloud-based platform that uses machine learning to extract structured data from unstructured documents. This data can then be analyzed and consumed more easily, making it a valuable tool for businesses of all sizes.
Here are some of the benefits of using Document AI:
- Improved accuracy: Document AI uses machine learning to extract data with greater accuracy than traditional methods.
- Reduced time and costs: Document AI can automate document processing tasks, which can save businesses time and money.
- Improved compliance: Document AI can help businesses comply with regulations by ensuring that data is extracted accurately and consistently.
- Increased productivity: Document AI can free up employees to focus on other tasks, which can lead to increased productivity.
If you are looking for a way to improve your document processing, Document AI is a great option. It is a powerful tool that can help you save time, money, and improve compliance.
What is Classification in Machine Learning?
Classification is a type of Supervised Machine Learning that predicts the category of a new data point based on its features. The model is trained on a dataset of labeled data points, and then it is evaluated on a separate dataset of unlabeled data points. If the model performs well on the evaluation dataset, it can be used to predict the category of new data points.
Here are some examples of classification tasks:
- Categorizing emails as spam or not spam
- Identifying objects in images
- Predicting whether a customer will churn
Classification is a powerful tool that can be used to solve a wide variety of problems. If you have a dataset of labeled data points, you can use classification to build a model that can predict the category of new data points.
Introduction to Custom Document Classifier
A Custom Document Classifier (CDC) is a machine learning model that can be used to classify documents into different categories. CDC’s are trained on a dataset of labeled documents, and they can be used to classify new documents with high accuracy.
CDC’s are a powerful tool that can be used to automate document processing tasks. They can be used to classify invoices, contracts, medical records, legal documents, financial statements and other types of documents.
Languages supported by Document Classifier Processor currently
Note : For detailed insights on languages supported kindly visit https://cloud.google.com/document-ai/docs/languages
Currently Supported Formats
In the following architecture, you can see a Custom Document Classifier that accepts all supported formats for training and predicts the desired trained labels, such as bank statements, birth certificates, and resumes.
Use Case
In this blog, you will see how to create a Custom Document Classifier Processor to identify the type of label of the document, as shown in the previous architecture.
Think about your project requirements, such as which group of documents you can use to classify, e.g., insurance documents, banking and finance statements, health diagnostic images, travel tickets etc.
Workflow
The workflow is the same for all custom document processors, as I have shown in my previous blog on Custom Document Extractor (link in references).
Create Custom Document Classifier Processor
Follow the steps from the standard product document link to complete below steps :
- Create a Processor
- Create a Cloud Storage bucket for the dataset
- Import your documents into a dataset
Imported Dataset Information
To create an exhaustive dataset for this blog, sample documents of modern resumes, birth certificates, and bank statements were downloaded from multiple websites, uploaded to the cloud storage bucket, and then imported into the Document AI dataset.
Formats used in this dataset: JPEG,PNG & WEBP
Note : The model can only be trained on documents with text, so narrow your dataset down to text only.
Sample Dataset Documents for each label
Bank Statement
Birth Certificate
Resume
Creating the Labels
Once you have completed the import, you must create labels that will be used to classify documents. You are free to do this step at any time before you start labeling documents.
Click the [EDIT SCHEMA] button on the left pane of the [TRAIN] tab.
Create all the desired labels for the document, and then click [Save].
Note: Considering appropriate label names for documents at the start of the process is very important. Once you train the model with the labels, they cannot be edited and can only be enabled or disabled.
Labeling the new documents / Auto-Labeling the documents while import
At the start of the project, labeling documents is a manual activity. However, labeling for Custom Document Classifier is super easy.
Click on the imported documents under [Unlabeled/Auto-labeled] to start labeling.
In the case of documents under the [Unlabeled] pool, which will be populated after importing documents in the dataset only if you import under the [NO-LABEL] option, you have to select the desired label for each document in the drop-down as shown below and click on [MARK AS LABELLED].
If you import documents under the [Auto-Labelled] pool, you will get a predicted label mapped to each document. This label is created when you import the data using the [AUTO-LABEL] option. You can confirm or correct the auto-labeled label, and then save it. This process saves time and speeds up labeling.
Note: Auto labeling will not work unless you have at least one trained and deployed version of your model. This option can be utilized after successful training after manual labeling.
You have 3 options for labeling while importing the documents in the dataset
CHOOSE LABEL- You can import selective label documents at once.
AUTO-LABEL- Auto label documents from any trained deployed version.
NO-LABEL- Import all documents without labeling.
Assigning documents to Training & Test set
After completing the labeling process, you have to distribute the documents between the Training and Test sets.
Consider a minimum of 10 labeled documents in the Training set and a minimum of 2 labeled documents in the Test set. You can assign as many more documents as possible to both sets.
Select all the desired documents under the Unassigned pool in the Data Split section and assign them to the Training and Test sets accordingly.
Training the processor
Once you have completed your data distribution with the minimum training criteria, click on the [TRAIN NEW VERSION] button to start training under the [TRAIN] tab.
Before starting the training, you can view the distribution of labels between the Training and Test sets by clicking the [VIEW LABEL STATS] button in the previous screenshot. You can see in the following screenshot that you are getting a green tick in the training guidelines, which is required for a good training.
You can click the [START TRAINING] button once you have satisfied the training requirements. It will take the desired time to train the model, so please be patient after starting training.
Training time depends on the complexity and quantity of the documents present in the dataset.
Note: Always try to train with the maximum number of documents possible in the training and test sets. This will save your time and resources from multiple training.
Deploying the processor
To test the processor, you must deploy the version. You can select the desired version to deploy under the [MANAGE VERSIONS] tab, as shown below.
Deployment will take a few minutes. After that, you can test the sample document.
Note: Deploy only those trained models that will be utilized for prediction via API. Every deployed model incurs a hosting charge after deployment. For more details, please refer to the pricing link https://cloud.google.com/document-ai/pricing. You can also explore the detailed pricing of the Custom Document Classifier.
Evaluating the processor
Once deployment is complete, you can select the version of the model to evaluate under the [Evaluate & Test] tab.
As you can see, the F1 score, Precision, and Recall are all high in our example. This is what we want to see in a good model.
Confusion Matrix understanding for evaluation
Before we dive into the confusion matrix, let’s understand the basic terms used for evaluation: True positive (TP), False positive (FP), False negative (FN), and True negative (TN).
Now that you have an understanding of the basic terms, the below standard confusion matrix can be referred to for a brief understanding.
Testing the processor
In the [EVALUATE & TEST] tab, you can upload a sample document from your dataset by clicking the [UPLOAD TEST DOCUMENT] button. The predictions of the trained labels will be displayed.
Resume Document Testing : The resume was predicted correctly with an F1 score of 1, which is as expected.
Birth Certificate Document Testing : The birth certificate was predicted correctly with an F1 score of 0.997, which is as expected.
Bank Statement Document Testing : The bank statement was predicted correctly with an F1 score of 0.998, which is as expected.
The outputs show that our model is correctly trained with all labels.
Consuming trained version via API’s
You can refer to the detailed code in the link https://cloud.google.com/document-ai/docs/send-request, which is available in multiple languages, such as REST, Java, C#, Python, and Node.js.
You can create your own APIs/Cloud functions on top of this Document API to consume the trained model. Then, you can store the output in any database, such as BigQuery, and make this data available for analytics.
Human-in-the-Loop(HITL)
This feature is a unique feature in which you can introduce a human for verification and corrections before it is used in case of the business critical applications where any kind of predictions which are below expectations cannot be afforded.
Note : For more information, please see the link https://cloud.google.com/document-ai/docs/hitl
Document Classifier Extended Architecture Example
Before concluding, please take a look at an example of how you can use Custom Document Extractor along with Custom Document Classifier to resolve any possible use cases.
In the above example, you can see that after classifying the documents via the Custom Document Classifier API, you can individually call the Custom Document Extractor API for each type of document if extraction is required at this level. You can then store the data in separate tables for each label.
Conclusion
Document AI’s Custom Document Classifier Processor is a valuable tool for businesses that can help them automate manual data classification tasks, improve data accuracy, enhance the customer experience, and improve compliance.
Consider which of your documents can be automated for document classification and how it can help you modernize your business.
References
Document AI : https://cloud.google.com/document-ai
Document AI pricing : https://cloud.google.com/document-ai/pricing
Custom Document Extractor : https://medium.com/google-cloud/utilizing-the-power-of-custom-document-extractor-of-document-ai-2a6b89898a30
Feel free to follow with me on LinkedIn and Medium and send me a message if you have any questions, would like to learn more, or have a thought to share. I’ll be in touch.
[Looking for the latest Google Cloud generative AI news? Check out The Prompt on Transform with Google Cloud.]