From the Edge: Extracting Info from Forms

Jake Tauscher
3 min readJul 28, 2020

--

This blog is part of a series on recent academic papers in the AI/ML community. By understanding what experts are spending their time researching, we will get a sense of the current limits and the future of the AI/ML world!

Researchers from Google and UC San Diego recently published a paper on a new approach for extracting key information (e.g. invoice date, invoice amount) from “form-like” documents (e.g., in this case, an invoice).

Why is this interesting?

First, it could be super valuable! Automating the ‘extraction’ of key information from forms is surprisingly difficult for AI if it has not seen that precise form before. For example, if you train an AI model on extracting info from the invoice on the left (amount due is in the bottom right), it will have trouble dealing with the invoice on the right, where the amount due is now in the top right.

And, many companies receive forms, like invoices, in many different formats. For example, there could be a different format from each different supplier. So, this type of AI model could save significant manual effort in data entry.

Second, I find their approach interesting because it relies on “teaching” the model a few rules that the researchers developed, before training it. Often in AI, the approach is to ‘under-engineer’ the model — the whole point is to give it the data and let it learn, without pre-biasing it in any way. So, I found it interesting that these researchers took a slightly different approach!

Tell me the details!

The researchers in this case built a three-part model. First, they observed that each ‘field’ is consistently the same data type. For example, if they are aiming to pull ‘invoice date’ from an invoice, the value should always be a date. So, to develop a list of ‘candidates’ for the right choice, they would pull all the dates off of the form.

Second, they would feed this data (the candidates plus information about them, including the words surrounding them and the relative position of these words) through a neural network to give each candidate a score, from 0 to 1, that corresponds to its likelihood to be the target data we seek (in our case, invoice date).

Finally, they assign to each field the candidate that is most likely to be the value of the field. This could be the highest scoring item in Step 2, but the researchers note that by separating the ‘scoring’ step from the ‘assignment’ step, they leave open the opportunity for business rules to play a part. For example, if the ‘due date’ must be at a later date than the ‘invoice date’ you could force that relationship when assigning the fields, taking the highest scoring outcome that also fits the required relationship.

What did they (and we) learn?

First off, I think this approach is interesting from a business perspective, because it could create buy-in with the user. By leveraging business rules (which are input from the users) it reduces a little of the ‘blackbox’ nature of AI.

As far as results of the experiment, the researchers found that their model performed well, outperforming ‘naïve’ benchmark models.

Additionally, the researchers found some interesting insights into the importance of the various input factors, showing that the most important factor to the prediction was the content of the neighboring text, followed by the candidate position on the form.

The researchers hope to continue their work by exploring how to expand it to account for ‘repeated fields’ (which might appear multiple times in a document), and also learn to adapt the model to specific domains.

And you can read the paper yourself! https://storage.googleapis.com/pub-tools-public-publication-data/pdf/59f3bb33216eae711b36f3d8b3ee3cc67058803f.pdf

--

--