Extracting Form Data to JSON, Excel & Pandas with Azure Form Recognizer

Aaron (Ari) Bornstein
Microsoft Azure
Published in
4 min readApr 23, 2020

TLDR; This post shows how to extract data from table images for pandas and Excel using the Azure Form Recognizer Service in Python.

What is Azure Form Recognizer Service?

The Azure Form Recognizer is a Cognitive Service that uses machine learning technology to identify and extract text, key/value pairs and table data from form documents. It ingests text from forms and outputs structured data that includes the relationships in the original file.

An example of a receipt that can be processed with the Azure Form Recognizer Service

There is a free tier of the service which provides up to 500 call a month which is more than enough to run this demo.

If you are new to Azure you can get started a free subscription using the link below.

How to Consume the Azure Form Recognition Service

The Azure Form Recognition Service can be consumed using a REST API or the following code in python.

However this code returns the result in JSON format with a lot of additional information not relevant to the actual processing of the form data.

Pictured Example JSON Reponse

The following code sample will show you how to reformat this JSON code with python into a pandas DataFrame so it can processed in a traditional data science pipeline or even exported to Excel.

Form Data formatted in a tabular Pandas DataFrame

Prerequisites

We will use the pre-trained receipt model for this tutorial. End to End Code Can be Found in the following gist.

  • Replace <file path> with the file path of your form or table (for example, C:\temp\file.pdf). This can also be the URL of a remote file.
  • Replace <endpoint>and<apim_key>with the values that you obtained with your Form Recognizer subscription key. You can find these on your Form Recognizer resource Overview tab pictured below
  • Replace <file type> with the file type. Supported types: application/pdf, image/jpeg, image/png, image/tiff.

Code

The end2end code can be found here https://gist.github.com/aribornstein/b18b6b6b46ed0715510fc95b32b55f15

Exporting to Excel

Once your data is an pandas DataFrame it can be converted to CSV to process with Excel in just one line of code.

df.to_csv(“form_data.csv”) # can now be processed with excel

Hope you enjoyed this demo of the power of the Azure Form Recognizer Cognitive Service. Check out the next steps to see how to train your own custom models and then use this code to extract them to pandas and or Excel.

Next Steps

  • Train a Custom Model on your own Form/Table Data
  • Link your model to a logic app to create an end to end dataprocessing pipeline

About the Author

Aaron (Ari) Bornstein is an AI researcher with a passion for history, engaging with new technologies and computational medicine. As an Open Source Engineer at Microsoft’s Cloud Developer Advocacy team, he collaborates with Israeli Hi-Tech Community, to solve real world problems with game changing technologies that are then documented, open sourced, and shared with the rest of the world.

--

--

Aaron (Ari) Bornstein
Microsoft Azure

<Microsoft Open Source Engineer> I am an AI enthusiast with a passion for engaging with new technologies, history, and computational medicine.