Data Extraction Using the Azure ‘Form Recognizer’ Custom Model

Sumit Kumar
Version 1
Published in
5 min readDec 11, 2020
Credit: Pixabay

The Need:

In the era of Automation, Machine Learning and AI, many companies still manually read and process thousands of forms (invoices, tax forms, handwritten form, etc.) and enter the data into structured schema.

Some companies even hire 3rd party companies for the processing of their forms, and the cost per page is prohibitive. In both cases, form processing is time consuming, tedious, expensive and encourages human error.

Automated data extraction from printed and hand-written forms is now a tried-and-true technology. Microsoft Azure provides many Cognitive Services (Vision, Speech, Language, Knowledge, and Search) to assist developers in introducing AI into applications to make them smart.

These Cognitive Services can be combined to make applications more intelligent, engaging, and discoverable. One such service which is part of Microsoft Azure Cognitive Service is Azure Form Recognizer.

Image from Microsoft Azure Cognitive Services Demos

Overview:

Form Recognizer is an AI-powered document extraction service that understands forms and extracts key-value pairs, tables, and text from documents such as W2 tax statements, completion reports, invoices, and purchase orders.

Along with printed forms, Form Recognizer has the added support for handwritten and mixed-mode (printed and handwritten).

E.g.: key-value pair extraction from a receipt.

Image from Microsoft Azure Cognitive Services Demos

Form Recognizer is composed of the following services:

· Layout API — Extract text, selection marks and table structures along with their bounding box coordinates from documents.

· Custom models — Extract text, key/value pairs, selection marks and table data from forms. These models are trained with our own data, so they are tailored to our forms.

· Prebuilt models — Extract data from unique form types using prebuilt models. Currently available are the following prebuilt models, Invoices, Sales receipts, Business cards.

In this blog we will mainly focus on the Custom Model, and how we can call them by using a REST API or client library SDKs to decrease complexity and integrate it into our workflow or application.

Form Recognizer — Custom Model:

Form Recognizer uses machine learning technology to extract data from the documents. For this, it needs at least 5 forms to create a model that subsequent forms can use as a reference point for labelling purposes.

The tool used that assists in labelling is the Form Recognizer Labelling Tool.

The Labelling Tool:

Microsoft has made this tool available via Docker. You can install Docker on your laptop or a VM (Virtual Machine). This is done with any operating system: Linux, Windows, or macOS. This is an excellent method, that lets us keep the modelling project on-prem. Some prefer to use a PaaS approach to install the tool as it guarantees its availability and not to rely on a laptop or VM. Please find the link to set up the tool here.

Pre-Requisites for building the custom model:

1. An Azure Subscription

2. A Form Recognizer resource (created via the Azure Portal)

3. An Azure Storage Account (to store the training document set)

4. Postman (an API testing tool)

We can train a custom model by manually labelling the training documents. Training with labels/tags leads to better performance in many settings. The returned CustomFormModel shows all the fields the model can extract, along with its estimated accuracy of each field.

To train with labels, we need to have some special label information files (for e.g. in case of a pdf training file: \<filename\>.pdf.labels.json) in the blob storage container alongside the training documents. The Form Recognizer sample labelling tool mentioned above provides us with a UI to help create these label files.

The training of the model can be done using the labelling tool itself or developers can use Form Recognizer client library to integrate the training process programmatically in the following languages Python, JavaScript, C#, and Java. For the sample code SDK of respective languages please visit the Microsoft GitHub page or the documentation.

After successful training we can extract key/value information and other user-defined tags from the custom form, using the model’s id of the trained model developed. The analysis can also be done on the labelling tool and programmatically using any 4 languages.

Along with training and testing the model, we can also manage the different models created. The following operations can be performed in relation to managing the models.

1. Check the number of models in the FormRecognizer resource account.

2. List the models currently stored in the resource account.

3. Get a specific model using the model’s ID

4. Delete a model from the resource account

The above operations can be performed using any of the 4 languages.

Credit: Christina Morillo

Going the extra mile:

Pre-Processing:

As we know that the Azure Form Recognizer uses the Azure Read API to perform the OCR on the handwritten text as well as printed. In the real world, we do not get perfectly scanned or oriented documents and the performance of the OCR can take a drastic hit.

This situation was handled by taking some pre-processing steps to enhance, accurately orient and remove any skewness from the documents. The pre-processing was performed using the open-source library Open-CV.

Post-Processing:

Also, after fetching the result, often the accuracy of the OCR values compared with the ground truth was less. This can be controlled by creating the post-processing step of auto-text correction by building a machine learning model that feeds in the domain-specific training data set.

Conclusion:

I strongly recommend using Azure Form Recognizer and adding the pre-processing and post-processing steps to complement the Azure service and to build a solution to suit your own specific requirements.

Learn More:

Sumit Kumar is a Data Analytics Consultant working in Version 1 for the past two years with a strong interest in Data Analytics, Data Engineering and Data Science. Sumit is currently working in the Version 1 Innovation Labs. The Innovation Labs has been in action since 2018 and has had many success stories in the form of successful collaborative Proof Of Values (PoVs). We are keen on engaging more without current and new customers to demonstrate how the latest technologies can add value to their business. To find out more about Innovation at Version 1, visit us here.

--

--