Document AI Workbench

Neil Kolban
Google Cloud - Community
15 min readNov 21, 2022

The Google Document AI (DocAI) service provides the capability to ingest documents in a variety of formats including PDF and images. DocAI will then parse the documents and extract a structured representation of the data. Let’s take that apart in more detail.

Here is a dataset on Kaggle that contains images of fake US W2 forms. The W2 is a form issued by a US employer declaring what an employee earned and how much taxes were withheld. At the end of the year, the W2 is sent to the employee as either a physical sheet of paper or in an electronic format (eg PDF). That person will then submit that as part of their tax returns.

If we look at an example of a W2 form, we see that it contains a lot of information.

Our example instance (the image above), shows it contains many expected fields. For example:

  • Employee Social Security Number (522–86–4190)
  • Employer (Hall Ltd Group)
  • Employee (Daniel Robinson)
  • Wages ($91,282.31)
  • Income tax withheld ($31,479.62)
  • … much more

As a person, we can look at this form and extract the information we need. We use labels, positions and other clues to determine this. If I asked you to transcribe the above W2 into a JSON document, I believe you could code:

{
"employe_ssn": "522-86-4190",
"employer_name": "Hall Ltd Group",
"employee": "Daniel Robinson",
"wages": 91282.31,
"income_tax_wh": 91282.31,

}

This (hypothetical) JSON document represents a machine readable structured representation of the same data contained in the document (PDF/Image). What DocAI can do is automate this process by taking a form (image) as input and generating a JSON structured document as output. DocAI is able to intelligently detect values for anticipated fields in a document.

So far, we have been looking at US W2 forms but those are merely an example of a class of document. DocAI can process a large number of other document types including receipts, invoices, utility bills, other government documents and many many more. When we present a document to DocAI for processing, we specify a parser that knows how to interpret that class of document. Typically, we use one of Google’s pre-supplied parsers.

However, what if we have a need to process a new type of document for which Google doesn’t have a suitable parser or what if we start to find that there are fields in our documents that aren’t being recognized properly? This is where we can build our own parser or improve upon the Google supplied parsers. This concept is what the remainder of this article will focus upon.

Let us consider an abstract document. Within that document we want to extract information and associate that information with specific concepts. For example, in a W2, the social security number or employee name. If I presented you with an example of such a document, you could likely identify those items.

To build our own parser, we start by making a list of all the distinct items that we wish to extract from an arbitrary document. DocAI calls these entities. Once we have the list of entities, we use these to create a schema which is basically a description of the multiple things we wish to extract. Each entity will have a name and a data type (string, number, date etc).

Now comes the fun part. We find ourselves a set of example documents of our desired class and, for each document, go through it and label each of the entities we desire to be extracted from the document. This labeling activity is performed manually. Labeling is the process of finding (by hand) the entities we desire and identifying them in the image of the document. Loosely, imagine drawing a box around the entity data and saying “This is the data for a Social Security Number” or “This is the data for an Employee Name”. We repeat this process for as many sample documents as we have time to manually label. At the end of this we have a labeled dataset … a collection of documents and their manually identified entities.

With this dataset available to us, the magic of Google DocAI comes into play. We present this dataset to DocAI to perform an activity called training. DocAI examines our manually labeled data and through machine learning algorithms, learns how to recognize entities in future documents. Normally in computing, when we perform calculations, the answers are either right or wrong. In the domain of machine learning, we enter a world composed of shades of gray. When a new document is presented to DocAI it is making predictions of what data corresponds to which entities. These predictions are usually accompanied by a confidence value meaning that DocAI is absolutely certain (1.0) that the value is correct or, more likely, it believes the value is correct with a confidence level (eg. 0.75). Obviously, we would like the predictions to be as accurate as possible. The way we improve accuracy is by training the parser with more examples of correctly labeled documents.

So far, all our discussions have been notional. Now we turn our attention to the actual reality of DocAI and see how to achieve our tasks. DocAI has a component within it called Document AI Workbench (workbench). Workbench is the area of DocAI where we can bring in sample documents, label those documents, perform training and review the outcomes.

We start by having a Google Cloud project and visiting the DocAI page.

The home page looks like:

On the left, we see Workbench. Go ahead and click it. The first time we use DocAI we have to enable it in our project.

After it has been enabled, we will arrive at the workbench page:

Click on CREATE PROCESSOR.

We are now asked to give it a name:

After naming our processor, we are presented with details about its existence. Take a moment to look at the screen:

The important things to note:

  • Name — This is the display name of the processor. It is not unique and merely acts as a visual aid.
  • ID — This is the actual identifier of the processor. It is this value we will eventually use when we submit a final document for processing.

Our processor is going to need sample labeled documents to train and test against. This data is owned by the processor and needs somewhere to store them. The next thing we need to do is tell the processor where we want to have it store the data. Workbench uses Google Cloud Storage as its backing store for documents.

Click the SET DATASET LOCATION button.

We are next given the opportunity to specify (or create) a Google Cloud Storage bucket (and optional folder) which should be used for storage:

After clicking the CREATE DATASET button wait a few moments and the screen will change to:

Notice now that the Dataset is tracking documents. Let us now look at what this means. When we give a labeled document to Workbench, it will use that document for one of two purposes. It will either use it as an example for training or it will use it as an example for testing. Training is the exercise where DocAI learns how to extract entities from documents (by example) while testing is where DocAI will make a prediction on a test document and compare what it predicted against actual / expected values and see how well it performed. When we provide a document to workbench, the document may not be assigned to either the test or training groups and is considered unassigned and is neither used for training or testing. The page shows us a summary of all of our documents.

Now let us bring in the first of our documents. To bring a document into workbench, it should initially exist somewhere in Google Cloud Storage. Google has created some sample documents that are used in the formal documentation and we will re-use those here.

There is a Google Cloud Storage bucket/folder called:

cloud-samples-data/documentai/Custom/W2/PDF

If we run:

gsutil ls gs://cloud-samples-data/documentai/Custom/W2/PDF

to list the content of the folder, we will find:

gs://cloud-samples-data/documentai/Custom/W2/PDF/W2_XL_input_clean_2950.pdf

In other words, there is a single PDF file at this location that if we were to download and look at, we would find it contains a W2 form.

We now switch to the TRAIN tab on Workbench:

And click on IMPORT DOCUMENTS:

Enter the path to our sample document (cloud-samples-data/documentai/Custom/W2/PDF) and select Unassigned for the Data split. Click IMPORT.

What is now happening is that DocAI is processing the document. While DocAI is primarily used to extract structured data from documents, it also provides general Optical Character Recognition (OCR) features. OCR is the process of recognizing words/numbers from an image and determining where on the image they can be found and what string of characters they are composed of. Distinguish this from entity extraction. Entity extraction is OCR combined with semantics. Entity extraction is “I have found this text in the image and it is associated with a specific concept such as Social Security Number”. After the import of our single document, the screen will change to:

We now see that we have one document in our dataset and that the document is unlabeled. By unlabeled, we mean that workbench knows that there is a document but we haven’t told it where the entities we wish to work with can be found in the document. It isn’t ready for training or testing.

If we were to look in the Google Cloud Storage bucket that we created for the workbench dataset, we would find that new files and folders have been created corresponding to the newly imported document. These should not be examined directly. They are the state of our processor and the existence and formats are known only to workbench.

Remember that we are using W2 forms here merely as a convenient example. At this point, our new processor doesn’t know about the existence of any entities. All it sees is a document and has no idea what we may be interested in extracting. Now is when we get to describe our schema which describes what we want to potentially find in such documents.

Click the EDIT SCHEMA button:

Here is where we describe the existence and nature of our entities. Workbench uses the phrase “labels” to refer to entities. Label is a proper term in machine learning but for our purposes, we will use this interchangeably.

Click the CREATE LABEL button to create our first entity definition:

Here we specify:

  • The name of our entity (CONTROL_NUMBER)
  • The data type of our entity (Number)
  • How do we expect it to occur in our document (It is required and there may be multiple occurrences of it)

Repeat the creation of entities for the following:

  • EMPL_SSN, Plain Text, Required multiple
  • EMPLR_ID_NUMBER, Plain Text, Required multiple
  • EMPLR_NAME_ADDRESS, Address, Required multiple
  • FEDERAL_INCOME_TAX_WH, Money, Required multiple
  • SS_TAX_WH, Money, Required multiple
  • SS_WAGES, Money, Required multiple
  • WAGES_TIPS_OTHER_COMP, Money, Required multiple

At the end we will have:

Click SAVE to save our work.

When we return to our main processor page, we will see that the associated schema is present:

For each of the labels (entities) we will see how many instances exist in our data (currently 0). Now that we have told workbench what entities we would like to detect, we can label our document to tell workbench (for this instance of a document) where these entities are found in this document. Click on the image of our document and we will see a new page appear:

Now we are going to go through the process of locating data on the page and labeling it as the correct entity. Click the Bounding Box selection icon on the menu bar and then select the Employer identification number value. From the popup menu, select EMPLR_ID_NUMBER. You have now performed your first entity labeling.

Repeat this process for as many of the other entities as you care to set:

At the conclusion, your page may look as above. Notice that the pairing of entities to their values is shown in the left panel. As you hover your mouse over the document you will also see the entity associated with the marked values.

We have now successfully labeled this document, click the MARK AS LABELED button.

Our summary screen changes to:

Notice that we now have one labeled document and we can see that we have instances of entities associated with documents. Now we assign the document to our training data. Click the check box on the document and under ASSIGN TO SET, select Training.

We are well on our way. We now have a single document that has been labeled and associated with our training documents. However, we are not ready to actually train our parser. Workbench requires that you have at least 10 documents for training and 10 documents for testing and even that is incredibly light. The more documents you have labeled and made available, the better (more accurate) the results will be. If you are building a production parser for your own documents, you should be prepared to provide and label hundreds of documents. For our article we thankfully don’t have to do that. A set of pre-labeled documents is available in our sample data and we will import these.

Click on IMPORT DOCUMENTS and provide cloud-samples-data/documentai/Custom/W2/JSON as the source path to the data and set the Data split to be auto-split. Specifying auto-split means that 80% of the documents will be used for training and 20% for testing.

This will take a few minutes to run. During this time, let’s consider what is happening. If we run:

gsutil ls gs://cloud-samples-data/documentai/Custom/W2/JSON

we will find that there are 50 JSON files in the bucket. Each one of these files represents a single document with labeling already attached. Think back to our activity of manually labeling a document. To do that we had a base image (our source document) and then identified where in that document the text for entities existed. Google has invented a JSON representation of that result which contains the image (Base 64 encoded) and the label information (eg. at a given rectangle in the image, text exists with a certain value and that should be associated with a named entity). The 50 JSON files in our sample bucket represents the labeling having been already done and made available to us via an export.

At the completion of our import, we may see a screen that looks as follows:

Here we see the 50 imported documents and 41 have been assigned to training and 9 to test. Pick a random document and click on it and see that it has been labeled (just like we manually did with our own single document).

Before we move on, we need to correct a problem. Google requires that there be a minimum of 10 documents available for testing. If we look at our example, we see only 9. This may occur because the random selection of 20% of 50 (exactly 10) may result in 9, 10 or 11 instances. To fix this, we will move one of our documents from the training set to the test set.

Click the checkbox on the Training category (we will be shown only the documents that are flagged for training) and then the first document and move it to Test:

We will now have 40 documents for training and 10 documents for test.

We are now ready to perform our training. Click the TRAIN NEW VERSION button:

Give the version and name and click START TRAINING:

Training now starts. Google doesn’t say how long the training will take. Experience with this sample shows it takes about an hour to completed. Go get lunch or come back tomorrow.

Note: At the time of writing (2022–11), the running of a training job may fail within the first 5 minutes with an internal error:

If this happens, simply retry the train operation. The DocAI Workbench is flagged as “preview” (beta) at this time.

With a trained parser available to us, we can now deploy it. Find the processor in the MANAGE VERSION tab and execute a “Deploy version” request from its context menu:

A confirmation dialog will be shown:

This operation may take 10 minutes or more to complete. Please be patient.

Now it is time to test. We can download a previously un-seen document from this URL:

https://storage.googleapis.com/cloud-samples-data/documentai/LendingDocAI/W2Parser/W2_XL_input_clean_1000.pdf

We now visit the EVALUATE & TEST page:

and click on UPLOAD TEST DOCUMENT. Supply the PDF file that we downloaded. The processing will take a few moments and results similar to the following are shown:

And this is perfect!!! We are now able to submit W2 documents to DocAI and DocAI will parse and extract structured data.

With a basic parser available to us, we can use this to build an even better version using a process called bootstrapping. We know that we can ingest new documents, label them and retrain. However, once we have a basic parser, we can use that parser to parse new documents to make a first pass at labeling. We should then use staff to examine the results of this output to validate the correctness and, very importantly, correct any errors. With this reviewed set of new documents where we had parser assist in labeling them we can add these to our set of training documents and re-train. The benefit of using the parser to parse new documents for training is that it will make a first pass / best guess which (hopefully) will reduce the effort from staff. Take care that the documents are reviewed carefully as errors introduced by an early parser will be negatively reinforced if the errors are not corrected.

We can demonstrate this by going to the TRAIN tab and clicking Import Documents to import some new documents. A dialog will be presented:

For the Source path specify: cloud-samples-data/documentai/Custom/W2/AutoLabel which is a Google supplied bucket/folder that contains a further 5 unseen documents.

For data split specify Unassigned.

Check the “Import with auto labeling” and select our latest deployed version.

Click IMPORT.

The documents found at the Google Cloud Storage location will be loaded into Document AI and have the parser run against them to generate a default set of labels.

We will see that we now have 5 documents that have been auto-labeled. Select each of the auto labeled documents, one a time and review their correctness:

Once the document has been corrected / reviewed, mark it as labeled. We now have the opportunity to perform a retraining on our data. Each time we retrain with more data, we should expect the accuracy to improve.

The example used in this article was lifted from the Google getting started article.

The following video is a walk-through of this article:

Glossary:

  • custom document extractor — A new DocAI parser that can be used at runtime to extract entities from a document.
  • entities — Field in a document that we wish to extract into a structured format.
  • uptraining — Enhancing a Google supplied DocAI parser to identify new entities or improve on existing ones.
  • parser — A trained model that can be used by DocAI to determine entities in a document.
  • processor — A synonym for parser.

Credits:

See also:

--

--

Neil Kolban
Google Cloud - Community

IT specialist with 30+ years industry experience. I am also a Google Customer Engineer assisting users to get the most out of Google Cloud Platform.