Comparing RNN and CNN models on invoice extraction: LSTM vs GRU vs TCN
My last post has explained why it makes sense to use neural networks for invoice extraction. Now we will take a look at how different neural network architectures perform on this task.
Applying neural network research
The development and optimization of neural network models has always been a heuristic domain. For the last years, Deep Learning research and their application has followed one of the following paths:
- modify an existing neural network architecture to reach better results (more layers, more regularization, better parameters, etc.),
- transfer an existing architecture to a different task or research field, or
- develop a new architecture or regularization technique that performs better than the previous state of the art (on specific tasks).
While varying in complexity, these approaches have one thing in common: they will test their model performances with standard benchmarks, like the CoNLL-2003 task for Named Entity Recognition (NER). One one hand, such standard benchmarks provide standardized data sets and an easy way to compare performances, but on the other hand, they tend to be very simple and unrealistic.
This means that if you want to find a good model for your specific task, you will need to implement and test different models that seem promising in theory, even if they have been developed for and tested on a different task. Therefore, as soon as you want to use neural network models on a specific application like extracting information from invoices, you have to transfer different models to your problem setting and see how it goes. In this post, we will explore the application of recurrent neural networks (RNNs) and convolutional neural networks (CNNs) on the task of invoice extraction.
Recurrent neural networks (RNNs)
RNNs have been the state of the art in NER and multiple other domains in Natural Language Processing (NLP) for multiple years, as there are designed to process input sequences (e.g. text strings) quite efficiently. Recurrent architectures, especially Long-Short-Term Memory (LSTM) networks, have become highly popular for NLP applications in 2015 thanks to technological advances and successful blog posts by Karpathy and Olah, which showed how effectively RNNs can be implemented to achieve impressive results.
The LSTM iterates a sequence (e.g. of text) element-by-element (e.g. words) in order to learn which sequence of elements leads to which type of result. A LSTM module contains three sigmoid gates that handle different tasks:
- forgetting (a certain amount of) the previous result (forget gate),
- including (a certain amount of) the current element (input gate), and
- calculating the result for usage in the next iteration step (output gate).
A sigmoid gate is a very simple function that basically multiplies an input by a value between 0 and 1, acting as a “filter” that can be trained to use a different factor depending on the context of an input (i.e. the previously seen elements). In order to coordinate these different gates, the LSTM calculates a cell state (which conveys the intermediate results from one iteration step to another, presenting the final result eventually) and a hidden state (which provides a snapshot of the current iteration step’s result).
In very simple words, this means that the LSTM is able to read a sentence word-by-word and create a result based the sequence of these words, handing itself intermediate results from each step to keep track of the “bigger picture”.
The Gated Recurrent Unit (GRU) is another recurrent network model that has been developed as a lighter and faster alternative to LSTM networks. In order to reduce complexity and shorten training time, the LSTM’s forget and input gates are combined to a single input gate, while cell and hidden states are combined to a single hidden state per iteration. Even though this design seems to convey less information, research has shown that the GRU can reach state-of-the-art performance in specific fields like Named Entity Recognition (NER).
In the context of invoice extraction during my own work (i.e. Named Entity Recognition with a large number of classes/categories), the GRU could not achieve the same performance as LSTMs unless enhanced with residual connections, which were first used in the ResNet. Thus, the cost of the ResGRU will be compared below.
Convolutional neural networks (CNNs)
CNNs have been highly popular in image recognition after disruptive research results like ImageNet have shown their strong performance. Even though images are usually two-dimensional data, a CNN can be used on a one-dimensional text input with a few changes, too.
The Temporal Convolutional Network (TCN) is a very good example for such an implementation: While standard CNNs can only work with fixed-size inputs and usually focus on data elements that are in immediate proximity due to their static convolutional filter size, the TCN employs techniques like multiple layers of dilated convolutions and padding of input sequences in order to handle different sequence lengths and detect dependencies between items (words) that are not next to each other, but instead are positioned on different places in a sentence.
The researchers behind the TCN have shown that their implementation is able to outperform standard RNN implementations on various NLP benchmarks, especially sequence modeling, a domain in which only RNNs seemed to provide state-of-the-art performance for multiple years.
Comparing Performance, Cost, and Complexity
The neural network models have been trained on a large dataset of approximately 47,000 invoice documents containing 48 different labels on word level. Testing has been performed on 1,100 additional documents that have not been part of training or validation. All experiments have been performed on Azure Servers equipped with a Nvidia Tesla V100 GPU with 16 GB of memory.
The test results show that the models produce very similar results: the models have achieved F1 scores of 0.6887 (ResGRU), 0.6969 (LSTM), 0.6933 (TCN) on word-level and 0.6799 (ResGRU), 0.6729 (LSTM), 0.6858 (TCN) on entity level.
However, varying neural network architectures can have significant differences in their computational complexity, which can be measured by the time a model needs to train or the amount of memory it needs. The following chart visualizes the relative cost difference between the three neural network architectures, LSTM, ResGRU and TCN:
Even though the TCN required the shortest time to finish a single epoch, the LSTM was able to reach its optimum with significantly less epochs and thus in shorter total time. The memory requirement showed the most significant differences, with the TCN needing four times the GPU memory compared to the LSTM — one reason why the TCN’s training took longer than the LSTM’s, as a high memory cost limits the batch size you can use for training (i.e. how many documents can be sent through the neural net per iteration). While requiring only slightly more time per training epoch, the ResGRU needed a larger number of epochs to reach its optimum, requiring almost three times as much time as the LSTM, due to the higher complexity of its residual connections.
Conclusion
As we can see, the three different neural network models have been able to achieve very similar performances regarding their F1 scores. However, the prediction accuracy is not the only metric that shows how well a model works. When considering the cost of each model regarding the amount of time and memory they need to train, there are significant differences between the architectures. In conclusion, when choosing a model for your implementation, it is wise to consider not only prediction accuracy but also the cost of a model, as this can impact the time to develop a production-ready model and the operational costs to run it.