Document AI | Inference APP and fine-tuning notebook for Document Understanding at paragraph level with LayoutXLM base

Pierre Guillou
5 min readMar 31, 2023

--

Document AI | Inference APP and fine-tuning notebook for Document Understanding at paragraph level with LayoutXLM base
Document AI | Inference APP and fine-tuning notebook for Document Understanding at paragraph level with LayoutXLM base

Through the publication of the DocLayNet dataset (IBM Research) and the publication of Document Understanding models on Hugging Face (for example: LayoutLM series and LiLT), a Document Understanding model at paragraph level had already been published: a LiLT base model finetuned on the dataset DocLayNet base with overlap chunks of 512 tokens. Today, a new Document Understanding model at paragraph level is published: a LayoutXLM base model (Microsoft) finetuned on the same DocLayNet base dataset with as well overlap chunks of 512 tokens. This model can label all paragraphs on all pages of any document (like a PDF) in any language with 11 labels (Caption, Footnote, Formula, List-item, Page-footer, Page-header, Picture, Section-header, Table, Text, Title). This post presents the Fine-tuning method and the Inference APP in Hugging Face Spaces, and their associated notebooks to test this model. Furthermore, an inference notebook is proposed to put it into production.

Notebooks

To read (Layout XLM base — paragraph level)

To read (Layout XLM base — line level)

To read (LiLT base — paragraph level)

To read (LiLT base — line level)

DocLayNet + Layout models in Open Source: Document AI is (truly) starting!

The recent publication of the DocLayNet dataset (IBM Research) and that of Document Understanding models (by the detection of layout and texts) on Hugging Face (LayoutLM, LayoutLMv2, LayoutLMv3, LayoutXLM, LiLT), allow the training of such models on PDFs, slides, images with text (etc.) with labels that interest the greatest number (Header, Footer, Title, Text, Table, Figure, etc.).

Many companies and individuals are waiting for such models. Indeed, being able to automatically and quickly extract labeled text from its documents makes it possible to fully exploit them to search for information, classify documents, interact with them via different NLP models such as QA, NER or even chatbots (humm… who is talking about ChatGPT here?)

Moreover, in order to encourage AI professionals to train this kind of model, IBM Research has just launched a competition: ICDAR 2023 Competition on Robust Layout Segmentation in Corporate Documents.

DocLayNet small/base/large and a DocLayNet Image Viewer APP: explore data to better understand it

In this context and in order to help as many people as possible to explore and better understand the DocLayNet dataset, I have already published 5 projects:

  1. the DocLayNet small, base, large datasets to facilitate the use of DocLayNet with annotated text (and not only with bounding boxes) (to read: “Document AI | Processing of DocLayNet dataset to be used by layout models of the Hugging Face hub (finetuning, inference)”);
  2. an APP (DocLayNet image viewer) to visualize the annotated bounding boxes of lines and paragraphs of the documents of the (to read: “Document AI | DocLayNet image viewer APP”).
  3. a LiLT base model finetuned on the dataset DocLayNet base with overlap chunks of 384 tokens at line level that uses the XLM-RoBERTa base model and its inference app and production code.
  4. a LiLT base model finetuned on the dataset DocLayNet base with overlap chunks of 512 tokens at paragraph level that uses the XLM-RoBERTa base model and its inference app and production code.
  5. a LayoutXLM base model finetuned on the dataset DocLayNet base with overlap chunks of 384 tokens at line level and its inference app and production code.

Next step: the code to fine-tune a Document Understanding LayoutXLM base model at paragraph level and get inferences :-)

The additional step was of course to train new Document Understanding models (at paragraph level) based on LayoutXLM base model (Microsoft) with one of the DocLayNet dataset (small, base, large) and to publish the code (code of fine tuning and inference through notebooks) in order to help other AI professionals to train even more models and more efficient ones.

So, I finetuned a LayoutXLM base model at paragraph level on the DocLayNet base dataset.

More precisely, I did the following fine-tuning:

Thus, this model allows to have a finetuned Document Understanding model that will work on more than 100 languages!

This work was made possible thanks to notebooks published by Hugging Face and in particular by Niels Rogge and Philipp Schmid:

Many thanks to them and to IBM Research team! And of course, to Microsoft for its LayoutXLM base model :-)

Production inference (at paragraph level) notebook

In order to obtain predicted labels at paragraph level of any document in PDF format for example, an inference notebook has also been published.

It uses the Open Source OCR Tesseract to get bounding boxes and texts. Then, the probabilities of the model predictions are processed to obtain the labels of these bounding boxes. As the model was finetuned with chunks of a 512 token limit, we use the same idea of overlap (128 tokens) in order to get predictions for all text on a page.

Notebook: Document AI | Inference at paragraph level with a Document Understanding model (LayoutXLM base fine-tuned on DocLayNet dataset)

Inference APP & notebook

In order to test this model or use it in production, you now have an APP and one notebook, too :-)

The Inference APP is hosted on Hugging Face Space and uses the model LayoutXLM base finetuned on the dataset DocLayNet base at paragraph level (chunk size of 512 tokens).

Notebook: Document AI | Inference APP at paragraph level with a Document Understanding model (LayoutXLM base fine-tuned on DocLayNet base dataset)

About the author: Pierre Guillou is an AI consultant in Brazil and France. Get in touch with him through his LinkedIn profile.

--

--

Pierre Guillou

AI, Generative AI, Deep learning, NLP models author | Europe (Paris, Bruxelles, Liège) & Brazil