Designing a Document Segmentation Architecture for Invoice Recognition

Ellen Schellekens
Ixor
Published in
4 min readJan 25, 2022

Choosing the right architecture for your deep learning task is difficult. To address this problem in an efficient way, it is important to understand why an architecture fails to provide good results. In this article I’ll demonstrate my process of finding a suitable architecture to perform Named Entity Recognition for invoices.

First Attempt

A good first step is to look at the segmentation algorithms, that are already implemented in torchvision. These architectures are proven to work in general settings, and there is no need to reinvent the wheel! The models that are implemented are constructed by two components: the backbone, a convolutional network that will extract useful features from the input document, followed by a classification head that will use these extracted features to predict the labels.

DeeplabV3 ResNet50

The first implemented architecture I experimented with was a ResNet50 backbone, followed by a DeeplabV3 classifier head. The DeeplabV3 classifier head utilizes Atrous Spatial Pyramid Pooling (ASPP) to capture multi-scale context. Additionally, image-level features are also encoded in the image pooling to achieve the best performance.

[The architecture of the DeeplabV3 classifier [1].

FCN ResNet50

The second architecture that was tested was the FCN ResNet50 model. Here, the ResNet backbone is followed by a simple Fully Convolutional Network.

The FCN ResNet50 architecture [2].

Experimental Results

The experimental results of these two models were quite disappointing. But why was that? The ResNet50 backbone is designed to extract high-level features from images to do image classification. In order to do this in a time and memory efficient manner, the generated feature map has a much lower dimension compared to the original input image. Afterwards, the classifier heads make sure the final segmentation map has the same dimensions as the input again, by interpolating the feature map.

Interpolation is used to go from one dimension size to another, by calculating the unknown values based on the proximate known values. In this case, the image is interpolated from a lower dimension to a higher dimension, which causes some blurring to happen. This is no problem for image segmentation, since the desired output is often the general region of the different segments. But in our use case of document segmentation, each word resides in exactly 1 pixel and our desired output is to label the individual words, and thus the individual pixels. The blurring effect of interpolation prevents the prediction to be precise for each individual pixel. This is why these models don’t work well for us, and why we want to avoid interpolation. To make sure that there is no need for interpolation in the classifier heads, we need to find a backbone network that has the same output dimensions as input dimensions.

Let’s Try This Again

The previous bad results do not mean that the classifier heads aren’t useful, we just need a backbone that returns a feature set of the same size as the input dimensions. For this purpose, UNet-like architectures are the perfect choice. In this architecture, max pooling is performed to make sure the output segmentation map have the same dimensions as the input image. In principle, UNet can already be used as standalone method, but adding an extra classifier head on top of it, like Deeplab and LR-ASPP will further improve the performance by being able to look at the complete picture. Since UNet is used as backbone, there is also no need for interpolation anymore in the classifier head.

The UNet architecture [3].

Experimental Results

Using this model, the results improved greatly. This shows that understanding your use case, and using that to understand why some architectures don’t provide good results, is a crucial part to the design of a deep learning architecture.

References

[1] Chen, L. C., Papandreou, G., Schroff, F., & Adam, H. (2017). Rethinking atrous convolution for semantic image segmentation. arXiv preprint arXiv:1706.05587.

[2] Labao, Alfonso & Naval, Prospero. (2017). Weakly-Labelled Semantic Segmentation of Fish Objects in Underwater Videos Using a Deep Residual Network. 255–265. 10.1007/978–3–319–54430–4_25.

[3] https://towardsdatascience.com/unet-line-by-line-explanation-9b191c76baf5

At IxorThink, the machine learning practice of Ixor, we are constantly trying to improve our methods to create state-of-the-art solutions. As a software company, we can provide stable products from proof-of-concept to deployment. Feel free to contact us for more information.

--

--