Document AI JSON Tool

Neil Kolban
Google Cloud - Community
5 min readJun 5, 2022

The highest level notion of Document AI is that one can send a document as a scanned or captured graphics image or a PDF and what will be returned is a representation of the information contained within it. The returned data is a JSON structure that contains a wealth of details on what was found. This will include discrete extracted pieces of information that Document AI calls “entities”. Contrast this with “simple” Optical Character Recognition (OCR) where what is returned is simply a blob of text corresponding to what the document contained but without interpretation.

With simple OCR you might get back:

Bob Smith\nPoodle Cuts\n123 Elm Street\nDallas\nTexas\n555–123–4567\n$1,000\n$50\n6.25%\n12–3456\nxyzzy

You have no idea what is a name, what is a phone number, what is a total amount or what is an invoice number. There is no context to the data. With Document AI, Google has developed “parsers” that are ML based which have been trained on huge numbers of documents. When you submit a document for processing to Document AI, what you get back is a contextual representation of the data:

  • Payer: Bob Smith
  • Company: Poodle Cuts
  • Billing address: 123 Elm Street, Dallas, Texas
  • Company phone: 555–123–4567
  • Total amount: $1000
  • Shipping: $50
  • Tax Rate: 6.25%
  • Purchase order: 12–3456
  • Invoice number: xyzzy

Pretty obviously we can see that this is dramatically more useful than just plain text.

Document AI is presented to us as an API service which we can invoke. We send it a document, and it returns us a JSON structure.

So far so good. Now let us put on our programmer hats and think about actually using Document AI. As we start to read the documentation on the API, we quickly find that the response from calling the service is, as expected, a JSON structure. However, we also find that it is not a trivial structure. It contains a lot of information and an example of such a piece of JSON can run to thousands of fields. If we try to open it in a JSON editor, we will quickly become lost. We will undoubtedly start reading the Google documentation of the JSON and start to make sense of the distinct components and categories within. Unfortunately though, if we encounter puzzles such as fields not containing expected values, trying to debug and correlate those can become a challenge.

The remainder of this article describes a sample open source tool that was created to try and make sense of Document AI JSON.

First, let us look at what Google provides us out of the box. When we create a Document AI processor in the console, there is a button labeled “UPLOAD TEST Document”

When we click on this and upload a document, we may get a result similar to the following:

This is pretty good. On the left we see the detected entities and on the right we are shown where in the document an entity was found. Very nice. However, it has some limitations:

  • We can’t tell how this maps to the JSON structure against which we are going to work
  • We are not shown a variety of other fields such as confidence scores
  • We have to parse a document to see this result; there is no support for saving a parse or for loading bulk parses

This brings us to our sample tool. This tool runs locally in a browser. We load a JSON document from our disk drive which is the captured result of the Document AI processing. The tool then shows us the results of Document AI processing (similar to the Google version above) but then goes on to allow us to drill into additional aspects such as the JSON data for any specifically selected entity.

From here, we can click on the information icon for each found entity and be shown the exact JSON that contributed to that entity. Other tabs include JSON where we can see the JSON document as a whole and DETAILS where we are shown the details of the document as a whole in tabular format.

What remains is for us to describe how to generate the JSON. This article isn’t going to cover the Document AI APIs; Google’s documentation on using Document AI is sufficient for that. However, here is a sample Makefile that can be used to execute a request:

LOCATION=us
PROJECT_ID=<YOUR PROJECT ID>
PROCESSOR_ID=<YOUR PROCESSOR ID>
IMAGE_FILE=<YOUR IMAGE FILE>
IMAGE_MIME=image/png
submit:
echo -n '{"document": {"mimeType": "$(IMAGE_MIME)","content": "' > docai_request.json
base64 --wrap=0 $(IMAGE_FILE) >> docai_request.json
echo -n '"}}' >> docai_request.json
curl -X POST \
-H "Authorization: Bearer "$(shell gcloud auth application-default print-access-token) \
-H "Content-Type: application/json; charset=utf-8" \
-d @docai_request.json \
https://$(LOCATION)-documentai.googleapis.com/v1beta3/projects/$(PROJECT_ID)/locations/$(LOCATION)/processors/$(PROCESSOR_ID):process > result.json
rm -f docai_request.json
echo "See result.json for results"

If you aren’t familiar with Makefiles, then here is an equivalent shell script:

#!/bin/bash
LOCATION=us
PROJECT_ID=<YOUR PROJECT ID>
PROCESSOR_ID=<YOUR PROCESSOR ID>
IMAGE_FILE=<YOUR IMAGE FILE>
IMAGE_MIME=image/png
echo -n "{\"document\": {\"mimeType\": \"${IMAGE_MIME}\",\"content\": \"" > docai_request.json
base64 --wrap=0 "${IMAGE_FILE}" >> docai_request.json
echo -n '"}}' >> docai_request.json
curl -X POST \
-H "Authorization: Bearer "$(gcloud auth application-default print-access-token) \
-H "Content-Type: application/json; charset=utf-8" \
-d @docai_request.json \
https://${LOCATION}-documentai.googleapis.com/v1beta3/projects/${PROJECT_ID}/locations/${LOCATION}/processors/${PROCESSOR_ID}:process > result.json
rm -f docai_request.json
echo "See result.json for results"

Running either of these generates a result.json file that contains the JSON output from sending an image to a Document AI processor. The key here is that we have a file (result.json) with which to work.

You can run the sample application directly from here:

https://kolban-google.github.io/docai-dev/

The application runs exclusively in your browser and doesn’t send any data outside of your local environment. The source of the project is also available.

And finally … a short video illustrating the tool in action …

References

--

--

Neil Kolban
Google Cloud - Community

IT specialist with 30+ years industry experience. I am also a Google Customer Engineer assisting users to get the most out of Google Cloud Platform.