Document AI: Bulk import & results

Published in

Google Cloud - Community

7 min readNov 21, 2023

Continuing from my last post on Document AI, I am going to show what the bulk import experience is like and a peek at the extracted data.

Schema additions and bulk upload

First, I’ve added additional fields to the document schema to complete labeling the prior document.

I have also uploaded a few more documents that need to be imported and labeled.

Bulk Importing
From the Document AI home screen, go to My Processors -> the processor previously created -> Build
Click the Import Documents button to get started:

Choose the storage location of the files in Cloud Storage. In this example, I am choosing Unassigned under Data Split. Then click the checkbox for auto-labeling and select a model that will be used to learn about the documents. Click Import. Once completed, click Manage Dataset.

Labeling en masse

Click the first document

So far mostly everything was mostly captured correctly by the AI model I used when importing the documents.
A few issues though.. The discount_amount field was mislabeled, and item_sub_category has a trailing comma. Now take a moment to think about this.. If it were not for the Generative AI feature set, manually capturing all of these fields, on every imported file, would be very time consuming.

Labeled document (but with minor errors)

Now let me rectify the mistakes. I am going to delete the entry for discount_amount and item_sub_category by hovering over the field and choosing delete.

Process to remove the value from a label

Very dependent on the document layout, one can choose either the bounding box tool or the text selector tool. In this example I am choosing to capture data using the text selector tool. Select the text and choose the relevant field from the drop down. Do the same for the other fields that need correcting.

Here is what the corrected document looks like. I am satisfied now, so I click confirm on all auto-labeled fields and click Mark as Labeled in the bottom left.

Once I had a couple documents labeled, I went back to Manage Dataset. Highlighted the Labeled line, click to select all labeled documents. Then assign it to the Training set.

Breaking up documents into Test/Training datasets

I am continuing labeling documents and when I have a few more completed, I’ll move those to the Test set, similar to what I did in the previous step.
Now is a good time to review label stats of the completed documents thus far.
In the lower right click the View Label Stats button

Once there, I have the choice of Model based or Template based training. In this example, I am using the Template based option as it has a lower watermark to be valid.
The watermark means that I have to have at least three documents in both the training and test sets that each has all the possible labels defined and labeled. In this example, I do not have enough documents for the discount fields. So I am going to continue labeling the rest of the documents.

Now that I completed labeling documents, I shuffled around documents between the training and test datasets to provide the following:
1. The training to test ratio should be about roughly 4:1

2. I have enough documents and labels to meet the requirements for the template based training

This is primarily required when training a Document AI model. I will discuss this topic in a future post.

Green checks for meeting the watermark with labeling

Reviewing document data

With all that completed, now what??? Let’s see what was extracted so far. In a new browser tab/window, point to Cloud Shell IDE.

In the terminal window below, execute the following commands to setup a Python environment:
mkdir doc-ai-temp
cd doc-ai-temp/
python3 -m venv .venv
source .venv/bin/activate
pip3 install google-cloud-documentai prettytable
touch main.py
cloudshell open main.py

admin_@cloudshell:~$ mkdir doc-ai-temp
admin_@cloudshell:~$ cd doc-ai-temp/
admin_@cloudshell:~/doc-ai-temp$ python3 -m venv .venv
admin_@cloudshell:~/doc-ai-temp$ source .venv/bin/activate
(.venv) admin_@cloudshell:~/doc-ai-temp$ pip3 install google-cloud-documentai prettytable
(.venv) admin_@cloudshell:~/doc-ai-temp$ touch main.py
(.venv) admin_@cloudshell:~/doc-ai-temp$ cloudshell open main.py
(.venv) admin_@cloudshell:~/doc-ai-temp$

Copy the following into main.py:

from google.cloud import documentai_v1beta3
from prettytable import PrettyTable

def get_document():
    ###
    # TODO:
    # Before executing uncomment the following three lines and set the variables for your Document AI processor
    # project_number = ""   # The project number where Document AI was deployed into
    # location = ""         # The location of where Document AI was deployed to: us or eu
    # processor_id = ""     # The processor ID that was used for this demo

    # Create a client
    client = documentai_v1beta3.DocumentServiceClient()
    dataset = "projects/{0}/locations/{1}/processors/{2}/dataset".format(project_number, location, processor_id)

    # Initialize a document list request
    list_request = documentai_v1beta3.ListDocumentsRequest(
        dataset=dataset,
    )

    # Execute the document list request
    page_result = client.list_documents(request=list_request)

    # Loop over the list document responses
    for response in page_result:
        document_id = response.document_id

        # Initialize a document get request
        document_request = documentai_v1beta3.GetDocumentRequest(
            dataset=dataset,
            document_id=document_id,
        )

        # Execute the document get request
        document_response = client.get_document(request=document_request)

        # Break the loop.  I just want the first document
        break

    # Use PrettyTable to generate a nice table
    table = PrettyTable(["Field Name", "Confidence", "Field Value"])
    table.align["Field Value"] = "l"
    table.sortby = "Field Name"

    # Loop over document.entities to get the document fields/data and add rows to our table
    for item in document_response.document.entities:
        table.add_row([item.type_, item.confidence, item.mention_text])

    # Print the table
    print(table)

if __name__ == "__main__":
    get_document()

In the Google Cloud console, I am grabbing the project_id, location and processor_id from the Document AI processor Overview page under Prediction. Make a note of the values and edit main.py in the TODO section to set these variables:

In the terminal window execute: python3 main.py
You may get a popup to authorize cloud shell. Click Authorize.
You should get output similar to this:

(.venv) admin_@cloudshell:~/doc-ai-temp$ python main.py
+-------------------+------------+-------------------------------------------------+
|     Field Name    | Confidence | Field Value                                     |
+-------------------+------------+-------------------------------------------------+
|      currency     |    1.0     | $                                               |
|  discount_amount  |    1.0     | 20                                              |
|   discount_total  |    1.0     | 35.30                                           |
|   freight_total   |    1.0     | 4.31                                            |
|    grand_total    |    1.0     | 145.52                                          |
|    invoice_date   |    1.0     | Jul 31 2012                                     |
|     invoice_id    |    1.0     | 33135                                           |
|   item_category   |    1.0     | Furniture                                       |
|  item_description |    1.0     | 9-3/4 Diameter Round Wall Clock                 |
| item_product_code |    1.0     | FUR-FU-2877                                     |
|   item_quantity   |    1.0     | 4                                               |
| item_sub_category |    1.0     | Furnishings                                     |
|  item_total_price |    1.0     | 176.51                                          |
|  item_unit_price  |    1.0     | 44.13                                           |
|      order_id     |    1.0     | CA-2012-BF10975140-41121                        |
|  receiver_address |    1.0     | 28205, Charlotte, North Carolina, United States |
|   receiver_name   |    1.0     | Barbara Fisher                                  |
|   shipment_mode   |    1.0     | Standard Class                                  |
|     sub_total     |    1.0     | 176.51                                          |
+-------------------+------------+-------------------------------------------------+

With the ability to capture data like this from a document, here are some use cases that can occur:

Document data can be inserted into databases (BigQuery, MySQL, SQLServer) for fast and targeting querying.
Document data can be analyzed like regular data.
Document data can be integrated with other data sources like Workday or Salesforce.
Document data can be used in conversations via chat-bots and natural language processing search.
Document data can be used in multi-lingual use-cases.
Document data can be used to detect fraud.
Add your why here..

Conclusion

I have shown how to upload and leverage Document AI Generative AI feature set to quickly label documents in bulk. I also touched on label stats while labeling documents. Label statistics will become clearer when training a custom model in situations where a pre-built model is not enough.
What I have shown so far, is just the introduction. Document AI has a list of purpose built processors for a wide variety of documents or build a custom one as I did here. For a list of processor pricing, review this page.

In a future post, I will use Document AI to develop a workflow process with other solutions to show how the extracted data can be used.

Thanks

The opinions stated here are my own, not necessarily those of my employer.

Document AI: Bulk import & results

Schema additions and bulk upload

Labeling en masse

Reviewing document data

Conclusion

Written by Gerard Samuel