Document AI | Inference APP and fine-tuning notebook for Document Understanding at paragraph level

Pierre Guillou
5 min readFeb 16, 2023

--

Document AI | Inference APP and fine-tuning notebook for Document Understanding at paragraph level
Document AI | Inference APP and fine-tuning notebook for Document Understanding at paragraph level

Through the publication of the DocLayNet dataset (IBM Research) and the publication of Document Understanding models on Hugging Face (for example: LayoutLM series and LiLT), a Document Understanding model at paragraph level has been fine-tuned and published: a LiLT base model with overlap chunks of 512 tokens that uses the XLM-RoBERTa base model. This model can label all paragraphs on all pages of any document (like a PDF) in any language with 11 labels (Caption, Footnote, Formula, List-item, Page-footer, Page-header, Picture, Section-header, Table, Text, Title). This post presents the Fine-tuning method and the Inference APP in Hugging Face Spaces, and their associated notebooks to test this model. Furthermore, an inference notebook is proposed to put it into production.

Notebooks

To read (Layout XLM base — paragraph level)

To read (Layout XLM base — line level)

To read (LiLT base — paragraph level)

To read (LiLT base — line level)

DocLayNet + Layout models in Open Source: Document AI is (truly) starting!

The recent publication of the DocLayNet dataset (IBM Research) and that of Document Understanding models (by the detection of layout and texts) on Hugging Face (LayoutLM, LayoutLMv2, LayoutLMv3, LayoutXLM, LiLT), allow the training of such models on PDFs, slides, images with text (etc.) with labels that interest the greatest number (Header, Footer, Title, Text, Table, Figure, etc.).

Many companies and individuals are waiting for such models. Indeed, being able to automatically and quickly extract labeled text from its documents makes it possible to fully exploit them to search for information, classify documents, interact with them via different NLP models such as QA, NER or even chatbots (humm… who is talking about ChatGPT here?)

Moreover, in order to encourage AI professionals to train this kind of model, IBM Research has just launched a competition: ICDAR 2023 Competition on Robust Layout Segmentation in Corporate Documents.

DocLayNet small/base/large and a DocLayNet Image Viewer APP: explore data to better understand it

In this context and in order to help as many people as possible to explore and better understand the DocLayNet dataset, I have already published 2 projects:

Next step: the code to fine-tune Document Understanding models and get inferences :-)

The additional step was of course to train the first Document Understanding models (at line and paragraph levels) with one of the DocLayNet dataset (small, base, large) and to publish the code (code of fine tuning and inference through notebooks) in order to help other AI professionals to train even more models and more efficient ones.

So, I finetuned a LiLT base model on the DocLayNet base dataset.

More precisely, I did 2 fine-tunings:

  • (line level) the fine-tuning of a LiLT base model at line level with overlap chunks of 384 tokens that uses the XLM-RoBERTa base model: nielsr/lilt-xlm-roberta-base (check this notebook);
  • (paragraph level) the fine-tuning of a LiLT base model at paragraph level with overlap chunks of 512 tokens that uses the XLM-RoBERTa base model: nielsr/lilt-xlm-roberta-base (check this notebook).

Thus, these models allow to have finetuned Document Understanding models that will work on more than 100 languages (see XLM RoBERTa)!

This work was made possible thanks to notebooks published by Hugging Face and in particular by Niels Rogge and Philipp Schmid:

Many thanks to them and to IBM Research team!

Finetuned Document Understanding model (at paragraph level)

(check the same model but at line level!)

As a second model, the finetuning was done at the paragraph level (thE first model was finetuned at the line level and the code is very similar).

The model was finetuned with chunks of a 512 token limit with overlap of 128 tokens in order to process all text on a page during the finetuning process.

Notebook: Document AI | Fine-tune LiLT on DocLayNet base in any language at paragraph level (chunk of 512 tokens with overlap)

Production inference (at paragraph level) notebook

In order to obtain predicted labels at paragraph level of a qualquer document in PDF format for example, an inference notebook has also been published.

It uses the Open Source OCR Tesseract to get bounding boxes and texts. Then, the probabilities of the model predictions are processed to obtain the labels of these bounding boxes. As the model was finetuned with chunks of a 512 token limit, we use the same idea of overlap (128 tokens) in order to get predictions for all text on a page.

Notebook: Document AI | Inference at paragraph level with a Document Understanding model (LiLT fine-tuned on DocLayNet dataset)

Inference APP & notebook

In order to test this model or use it in production, you now have an APP and one notebook, too :-)

The Inference APP is hosted on Hugging Face Space and uses the model LiLT base combined with XLM-RoBERTa base and finetuned on the dataset DocLayNet base at paragraph level (chunk size of 512 tokens).

Notebook: Document AI | Inference APP at paragraph level with a Document Understanding model (LiLT fine-tuned on DocLayNet dataset)

About the author: Pierre Guillou is an AI consultant in Brazil and France. Get in touch with him through his LinkedIn profile.

--

--

Pierre Guillou

AI, Generative AI, Deep learning, NLP models author | Europe (Paris, Bruxelles, Liège) & Brazil