Using Neural Networks for Invoice Recognition

Published in

Ixor

4 min readJan 6, 2020

About two years ago, we started developing a machine learning model for named entity recognition (NER) on invoices. For a computer, this kind of unstructured data is far more complex to analyze than it is for humans. In one blink of an eye, we can understand the meaning of words and numbers, and we do all this by taking document structure and layout into account. It is a real challenge to create a ML model that is able to combine word meanings and document layout in the same way. However, this feels like a perfect problem to be solved by artificial neural networks, but it would need a significant number of unique invoice templates. One year ago, we did not have this amount of data at IxorThink, so we came up with a first invoice recognition model using XGBoost. If you didn’t read it yet, you can find the blogpost here.

A new model structure

For our new field-detection model, we got a lot of inspiration from a recent paper named “CUTIE: Learning to Understand Documents with Convolutional Universal Text Information Extractor” (Xiaohui Zhao et al. June 2019).

Their proposed model applies convolutional neural networks to extract important information from both semantic meaning and document layout. Specifically, the network is applied on a grid of text, to incorporate the document structure, while words are embedded as features for semantical connotations. Their solution proves to perform much better on receipts than the basic NER solutions because it exploits 2D layout information (instead of 1D information for normal NLP NER solutions). Secondly, the experimental results are suprisingly good on relatively small datasets of training data.

We used this idea as the base for our invoice recognition model: using a CNN to combine semantic and 2D structural information. We further improved accuracy by adding different types of data augmentation and adding an attention module inside the CNN network structure.

Input data and network structure

To transform words to embeddings which enclose important semantic meaning we used Word2Vec (this is readily available in Gensim, and can be custom trained). These kinds of embedding models turn text into a numerical form that neural nets can understand by using the context information of every word. However, training a Word2Vec model using financial documents is not the best idea; amounts, currencies, and dates are mostly unique which means that the embeddings will be of very low quality.

Like the practice of stemming every word before an embedding model is trained, we transform every numeric value to its pattern. This means, for example, transforming ‘23,99’ to ‘xxpxx’ and a date like ‘23/09/2019’ to ‘xxpxxpxx’. This should help to learn quality word-embeddings, even with a small number of documents.

Model structure to combine structural and semantic information for NER on invoices.

The layout of the documents is parsed to a grid. Using the location of every word, the embedding is placed in the matching word-location of the resulting grid. Because of this, the layout is preserved by using empty (zero-embedding) cells. The convolutional neural network learns to map the embedding grid to the output grid, which has the same dimensions but contains a certainty (or ‘probability’ if you would like) for every possible class. As proposed in the CUTIE paper, the network architecture consists of multiple atrous (dilated) convolutions, including an ASPP module (Atrous Spatial Pyramid Pooling). This module combines different dilated convolutional layers in a parallel manner. All of this clearly increases the learning capacity of the model. Exploiting both, this 2D structural information and word semantics, should greatly improve our scores.

Data Augmentation

The fact that documents are transformed to a grid containing embedded text does not only improve the capacity to learn structure, it also makes data augmentation possible for the layout. This means we can easily add vertical and horizontal white spaces, move cells up and down and switch words by switching the content of neighboring cells.

CNN for the win?

Even with a relatively small dataset (around 600 unique invoice templates), our model is able to perform surprisingly well. You can take a look at the results to see the big leap we took in comparison with our previous invoice model 1 year ago (on a mixed testset of seen and unseen invoice templates).

Comparison of new CNN vs old XGBoost NER model

It gets even more interesting if we look at the recognition scores on new unseen templates, a complex task that was certainly not possible using the previous version. As a result, these are actually very competitive scores, which can be improved by training on client-specific datasets to increase to almost fully correct field-detection.

Comparison of our CNN on a small dataset of unseen invoices vs competitors (also on our testset).

And now?

Of course, we try to keep improving our techniques for document analysis. At this moment we are developing techniques to improve our model for a specific client, without the problem of “catastrophic forgetting” (a neural net trained on new data tends to forget its skills). Furthermore, we are expanding the IxorDocs AI toolkit to be able to do NER and classification on all kinds of unstructured and structured documents.

At IxorThink we are constantly trying to improve our methods to create state-of-the-art solutions. As a software-company we can provide stable and fully developed solutions. Feel free to contact us for more information.