New Google AI Neural Network Extracts Structured Information from Documents
Google AI recently published a neural network that extracts structured information from template documents. Unlike previous approaches, the model uses knowledge of the types of target information to select and rank recognized portions of text in a document. Experiments on the corpus of invoices and checks show that the neural network is generalized to the types of documents on which it was not trained.
What is the problem
Template documents like receipts, invoices, and insurance quotas have a lot of different business uses. At the moment, the processing of such documents is mostly based on manual labor. At the same time, existing automated systems are based on heuristics that are not resistant to errors and discrepancies in the format of documents. Researchers propose a neural network approach for extracting information from template documents.
How the model works
The proposed approach allows developers to train and deploy a system for extracting data from documents of a certain type. The model takes as input a target schema, which contains a list of fields to retrieve and their types, and a small set of markup documents.
The model extracts data of the following types: dates, numbers, alphanumeric codes, currency signed numbers, telephone numbers, and links. The input document first goes through a character recognition (OCR) service. At this stage, the document is converted from PDF or image to text format. The resulting text is run through a candidate generator that selects potentially needed parts of the text. The candidates are then ranked using a neural network.
Model performance evaluation
For training and validation, the researchers used a dataset with accounts of different formats. We tested the system on documents of a format that the model had not seen before.