NLP Data Augmentation for Document Understanding

Published in

SOGEDES tech savvy

7 min readNov 23, 2022

Summary

In this blog, we have tried to enhance the performance of a multi-lingual document understanding model in a semantic entities extraction task by applying data augmentation in the documents. In the end, we were able to increase the F1 score of the FUNSD and XFUND dataset with the help of translating, back-translating and re-labeling the documents in the dataset. But before we go deep into the data augmentation method, lets us first understand the concept of document understanding.

What is Document Understanding?

The idea of document understanding is to automate the data extraction process and document classification process from different kinds of documents such as letter, receipt, invoices, etc. with the help of artificial intelligence (AI). It has been widely used in the industry to accelerate and simplify the work of the employee. You can imagine how life is much easier when you don’t need to spend hours reading tons of documents to summarize the data which you need to process everyday. It sounds really cool to be able to do that, but how accurate can the AI be in predicting the usable data for us?

LayoutXLM

In order to answer that question, the team from Microsoft did a great job by presenting the latest multi-lingual visually-rich document understanding model, which is the LayoutXLM model. This model had been pre-trained with millions of documents which consist of 53 different languages. It has a multi-modal Transformer architecture which takes text, visual and layout information as input to integrate with the cross-modal interaction end-to-end in a single framework. Besides that, the transformer architecture is also built with a spatial-aware self-attention mechanism, text-image alignment, and text-image matching as the pre-training strategies. As a result of this fine model structure and pre-training strategies, it has achieved state-of-the-art (SOTA) performance in visually-rich document understanding tasks. They had tested the model in the XFUND dataset and the result can be found in their official report.

Here is the result of FUNSD and XFUND dataset from the official report: https://arxiv.org/pdf/2104.08836.pdf

We can get the dataset as shown in the link below:

After knowing how good the model can perform on the multilingual dataset, another important question just came into our mind; is there any way in which we can improve the result?

Dataset

It is important to know more about the dataset before we discuss more about how to improve the result. XFUND is a multi-lingual form understanding dataset which consists of 199 fully annotated forms for seven different languages, which are German, Chinese, Japanese, Spanish, French, Italian, and Portuguese. It was originally prepared to validate the performance of LayoutXLM model. In the implementation only the German version of XFUND had been used to do data augmentation. An example of XFUND document will be shown as below:

What about FUNSD dataset? FUNSD is similar to XFUND, it is also a form understanding dataset which consists of 199 fully annotated forms in English language, but it has a very different layout structure as it contain handwritten word and scanned forms. So the document is more noisy than the XFUND. An example of FUNSD document will be shown as below:

After knowing more about the dataset, let’s move on to the previous question. How to improve the result?

Data Augmentation

What if we could improve the results by performing data augmentation on those documents? As we already know, data augmentation is a remarkably effective technique in improving the performance and results of neural networks, not only in computer vision but also in Natural Language Processing (NLP) by generating new and different examples for training. In the field of computer vision, there are some common techniques which had been widely used, for example geometric transformations, color distortion, information deletion, etc. What about in the field of NLP? The common techniques which been utilized by the community are Easy Data Augmentation (EDA), NLP Albumentation, NLP Aug, translation, back-translation, etc.

We can learn more about those data augmentation methods in the link below:

EDA: https://github.com/jasonwei20/eda_nlp
NLP Albumentation: https://github.com/albumentations-team/albumentations
NLP Aug: https://github.com/makcedward/nlpaug
Back translation: https://towardsdatascience.com/data-augmentation-in-nlp-using-back-translation-with-marianmt-a8939dfea50a

Since document consist of both image and text, how exactly can we perform data augmentation on it?

Approach

In order to create synthetic document which have almost the same structure layout and preserve the intention as the original document, we had decided to use text translation technique to maintain the semantic feature of the text and apply information transformation technique to modify the original text in the image. So in the end, we will receive a new synthetic document with a different wording structure and the same layout.

To be able to align the translated text without overlapping with other word and written in only a specific part of the image, the bounding box for the whole sentence (sentence bbox) will be needed to calculate the area where the text will be replaced, and it will be also used to calculate whether the original sentence is written in horizontal or vertical form. If the original sentence is written in vertical, the vertical sentence will be covered up with white color. It is possible to replace the text in vertical, but the datasets only have very small amount of vertical text, and it will not affect the texture of the synthetic document, so this feature will be ignored for now.

After knowing the area of sentence bbox and the text which need to be replace, the first word of the text will be insert on the starting of the sentence bbox and the following words will be insert in a manner of increment from the previous width of the word with a small space to separate the previous and the next word until the end of the sentence bbox. When it reaches the end of the sentence bbox, the next word will be inserted in a new row. This will be continued until the last word of the translated text. In the end, we will receive a new document with a new coordinate for each of the new words.

The approach of creating new synthetic documents

After finishing augmenting both FUNSD and XFUND dataset, the amount of documents in the new augmented dataset had increased by 9 times, since the document had been translated to 4 different languages and back-translated to the original language. A graph representing the amount of documents will be shown as below.

Comparison the number of documents between original and augmented dataset

Now we have the most important question after finishing augmenting our new dataset, how will it perform in the LayoutXLM model?

Result

The pre-trained model which will be used during the training is LayoutXLM model. During the training process, there will be 2 different methods to fine-tune the pre-trained model. The difference between the 2 training methods is the way of labelling the bounding boxes (bboxes). The first method of labelling the bboxes is by using the word bboxes as the input, for example: there are 4 words in a noun, and there will be 4 different bboxes for each of the words. The second method of labelling the bboxes is by using the sentence bboxes as the input, for example: there is 4 words in a noun, and the 4 words will have the same bbox which is the sentence bbox.

Example of word bounding boxes and sentence bounding boxes

After fine-tuning the LayoutXLM model with the input method of word bboxes, the F1 score of FUNSD dataset had increased from 0.7940 to 0.8209. Unfortunately, the F1 score of the XFUND German dataset did not perform better. What surprised us was that by using the sentence bboxes as input, the model start to learn better. The F1 score for FUNSD dataset had increased from 0.7940 to 0.8676 and the F1 score for XFUND dataset had increased from 0.8222 to 0.8671.

Comparison between state-of-the-art result and the result of new augmented dataset

When we feed the model with sentence bboxes as input, the result gets apparently better. By using this method, the model understands better that, for example, the 4 different words of United State of America is actually a noun and has a lower probability of wrongly predict the exact label for each of the word, for example, the word ‘State’ from United State of America might predicted as B-Location instead of I-Location.

We think that with the help of sentence bboxes, it simplifies the prediction for the model as it already knows that the words from the noun are related to each other. This make prediction much more accurate and getting a better result. As for word bboxes, this might sometimes confuse the model by not understanding the relation between each word, which leads to false predictions.

Conclusion

The idea of this data augmentation technique is simple to implement and works for most of the documents. It could be a useful tool to pre-process the dataset before training a document understanding model. This would not only help us to reduce the cost of collecting and labelling new data, but also getting better predictions since the accuracy normally increases.

I hope this article will help you understanding more in document understanding and also learn more about data augmentation not only in image or text but also in document. Last but not least, this project was part of my thesis and I would like to thank SOGEDES GmbH for giving me the opportunity and assistance to finish this project in the company!