Beginner’s guide to Extract Receipt’s Information using Deep Learning (OCR & NLP Model)

Published in

MyNextDeveloper

5 min readFeb 12, 2021

What is Information Extraction?
In the context of receipts, one can easily extract text from a receipt by using an OCR tool. With Information Extraction, we go a step further by giving meaning to the extracted text and converting it into information.

To solve this problem, I have used the approach provided by the ICDAR 2019 SROIE competition. The approach consists of three tasks.

Task 1 - Text Detection: Extract positional coordinates of words in an image
Task 2 - Text Recognition: Extract actual characters of a word in the image.
Task 3 - Information Extraction

Following is my approach which covers, how one can use PaddleOCR and LayoutLM (NLP Model) to extract the hotel name, date, and total amount from a receipt image. Further, I will also discuss how you can extract other information like item rows through this method.

PaddleOCR (Task 1 and Task 2)

It’s an OCR toolkit like Tesseract, but it is better for robust reading tasks. With the help of PaddleOCR, one can easily get bounding boxes (Task 1) and recognize text (Task 2).

Following is the formatted output of the code (Task 1 and 2 combined). The output consists of rows of boundary boxes coordinate and text within those boxes, as shown below.

LayoutLM (Task 3)

LayoutLM is a simple but effective multi-modal pre-training method of text, layout and image for visually-rich document understanding and information extraction tasks, such as form understanding and receipt understanding

In other words, with the help of the pre-trained model of LayoutLM, one can process a document’s image and extract information from it.

To extract information from a document through LayoutLM, I need positional data (Task 1) and recognition data (Task 2) of text present in the document’s image because this will serve as input for our LayoutLM model.

LayoutLM can perform two kinds of tasks
1. Classification: Predicting the corresponding category for each document image
2. Sequence Labelling: It aims to extract key-value pairs from the scanned document images.

The information extraction task will require sequence labelling.

Implementation

LayoutLM has a good architecture for receipt information extraction but it doesn't come completely trained for that task. Therefore it requires further training.

Dataset and Weights:
As LayoutLM is a pre-trained model one won’t need a large dataset to further train it. But it depends on the visual structure of your receipt. If you want to target a wide range of receipt formats you would need at least 200 images per targeted format to give you decent if not good performance. For further training, I am using the SROIE dataset of ICDAR 2019.

SROIE Task 1 &2 data for the above image from the dataset

SROIE Task 3data for the above image from the dataset

For pre-trained weights, I am going to use this Kaggle model which is further trained on native LayoutLM pre-trained model for sequence labelling task. This will give us a good head start as the receipt information extraction requires sequence labelling tasks.

Data pre-processing (Transitioning from Task 1 & 2 to Task 3):
With the help of Tasks 1 & 2, I get positional and recognition data of a boundary box. In a boundary box, there can be a single word or multiple words but to utilise the layout information of each document, LayoutLM needs to obtain each word/token with its location in the image. This means LayoutLM needs Task 1 and 2’s data for every individual word but existing data is in the set of words. To get data for every single word I performed some arithmetic/processing for conversion as shown here.

To create a training set while performing the above processing steps, one will need to define the “labels” (as mentioned in the above link) for the sequence labelling task. So while predicting, the model will classify each word into one of these labels. Hence the OCR data needed to be converted into individual words. The labels are created based on BIO (IOB) tagging.

In the file below there are four labels consisting of three target labels and a blank label (“O”) which is assigned to the word when no other target label is assigned.

After processing and label assigning this is how the input and reference output for model training will look like.

Processed model input for training.

Reference output for model training

The processed data should be formatted in a certain way into certain files before they go for training. It's covered better here.

Now the data is processed and ready for training and testing using these commands.

Training Command:

Testing Command:

Result:
These are the results I have gotten after training and testing on SROIE Dataset.

Prediction:
Now I can use this model to predict the hotel name, date and total amount from any receipt.

This is the prediction for receipt shown in Tasks 1 & 2. The model has labelled every word (token) with a corresponding label.

This prediction can be parsed into JSON format like this:

Conclusion:

The prediction shows how powerful LayoutLM is. Even though the format of the prediction receipt is not exactly the same as a receipt in SROIE Dataset, the model still gives a decent prediction.

To improve accuracy one can tune the hyperparameter or train on more data based on your need.

Extracting more Information:

To extract further information like item row you would need to create appropriate labels for each item row using BIO format. You would also need to create custom training data consisting of your newly created labels.

Ending Note:

This was my approach to solving this problem. It is part of a product being developed for a client at MyNextDeveloper. If you have any doubts regarding the approach, I am looking forward to solving them. If you have any different approach, please consider sharing. You can also reach out to me on Twitter.

References:

Overview - ICDAR 2019 Robust Reading Challenge on Scanned Receipts OCR and Information Extraction

Scanned receipts OCR is a process of recognizing text from scanned structured and semi-structured receipts, and…

rrc.cvc.uab.es

PaddlePaddle/PaddleOCR

English | 简体中文 PaddleOCR aims to create multilingual, awesome, leading, and practical OCR tools that help users train…

github.com

microsoft/unilm

December 29th, 2020: LayoutLMv2 is coming with the new SOTA on a wide variety of document AI tasks, including DocVQA…

github.com

LayoutLM Starter

Explore and run machine learning code with Kaggle Notebooks | Using data from multiple data sources

www.kaggle.com

preprocessing format?? · Issue #81 · microsoft/unilm

You can't perform that action at this time. You signed in with another tab or window. You signed out in another tab or…

github.com

Beginner’s guide to Extract Receipt’s Information using Deep Learning (OCR & NLP Model)

PaddleOCR (Task 1 and Task 2)

LayoutLM (Task 3)

Implementation

Conclusion:

Extracting more Information:

Ending Note:

References:

Overview - ICDAR 2019 Robust Reading Challenge on Scanned Receipts OCR and Information Extraction

Scanned receipts OCR is a process of recognizing text from scanned structured and semi-structured receipts, and…

PaddlePaddle/PaddleOCR

English | 简体中文 PaddleOCR aims to create multilingual, awesome, leading, and practical OCR tools that help users train…

microsoft/unilm

December 29th, 2020: LayoutLMv2 is coming with the new SOTA on a wide variety of document AI tasks, including DocVQA…

LayoutLM Starter

Explore and run machine learning code with Kaggle Notebooks | Using data from multiple data sources

preprocessing format?? · Issue #81 · microsoft/unilm

You can't perform that action at this time. You signed in with another tab or window. You signed out in another tab or…

Written by Mayank Ramina