Document AI | Document Understanding model at line level with LiLT, Tesseract and DocLayNet dataset

Pierre Guillou
5 min readFeb 10, 2023

--

Document Understanding model at line level with LiLT, Tesseract and DocLayNet dataset
Document Understanding model at line level with LiLT, Tesseract and DocLayNet dataset

The publication of the DocLayNet dataset (IBM Research) and that of Document Understanding models on Hugging Face (for example: LayoutLM series and LiLT) allow (at last!) the training of such models on all documents with text (for example: PDFs, slides, images with text) with labels that interest the greatest number (for example: Header, Footer, Title, etc.). Many companies are waiting for such models in order to fully exploit their documents or interact with them via different NLP models or even chatbots (humm… who is talking about ChatGPT here?). This post presents the results of the training of such a Document Understanding model, and also its finetuning and inference code via 2 notebooks. The finetuned model is also freely available online at the Hugging Face hub.

Notebooks

To read (Layout XLM base — paragraph level)

To read (Layout XLM base — line level)

To read (LiLT base — paragraph level)

To read (LiLT base — line level)

DocLayNet + Layout models in Open Source: Document AI is (truly) starting!

The recent publication of the DocLayNet dataset (IBM Research) and that of Document Understanding models (by the detection of layout and texts) on Hugging Face (LayoutLM, LayoutLMv2, LayoutLMv3, LayoutXLM, LiLT), allow the training of such models on PDFs, slides, images with text (etc.) with labels that interest the greatest number (Header, Footer, Title, Text, Table, Figure, etc.).

Many companies and individuals are waiting for such models. Indeed, being able to automatically and quickly extract labeled text from its documents makes it possible to fully exploit them to search for information, classify documents, interact with them via different NLP models such as QA, NER or even chatbots (humm… who is talking about ChatGPT here?)

Moreover, in order to encourage AI professionals to train this kind of model, IBM Research has just launched a competition: ICDAR 2023 Competition on Robust Layout Segmentation in Corporate Documents.

DocLayNet small/base/large and a DocLayNet Image Viewer APP: explore data to better understand it

In this context and in order to help as many people as possible to explore and better understand the DocLayNet dataset, I have already published 2 projects:

Next step: the code to fine-tune a Document Understanding model and get inferences :-)

The additional step was of course to train a first Document Understanding model with one of the DocLayNet dataset (small, base, large) and to publish the code (code of fine tuning and inference through notebooks) in order to help other AI professionals to train even more models and more efficient ones.

So, I finetuned a LiLT base model on the DocLayNet base dataset.

More precisely, I did the fine-tuning of a LiLT base model at line level with overlap chunks of 384 tokens that uses the XLM-RoBERTa base model: nielsr/lilt-xlm-roberta-base (check this notebook).

Thus, this model allows to have a finetuned Document Understanding model that will work on more than 100 languages (see XLM RoBERTa)!

This work was made possible thanks to notebooks published by Hugging Face and in particular by Niels Rogge and Philipp Schmid:

Many thanks to them and to IBM Research team!

Finetuned Document Understanding model (at line level)

As a first model, the finetuning was done at the line level (a next model will be finetuned at the paragraphs level and the code is very similar).

The model was finetuned with chunks of a 384 token limit with overlap of 128 tokens in order to process all text on a page during the finetuning process.

Notebook: Document AI | Fine-tune LiLT on DocLayNet base in any language at line level (chunk of 384 tokens with overlap)

Inference (at line level)

In order to obtain predicted labels at line level of a qualquer document in PDF format for example, an inference notebook has also been published.

It uses the Open Source OCR Tesseract to get bounding boxes and texts. Then, the probabilities of the model predictions are processed to obtain the labels of these bounding boxes. As the model was finetuned with chunks of a 384 token limit, we use the same idea of overlap (128 tokens) in order to get predictions for all text on a page.

Notebook: Document AI | Inference at line level with a Document Understanding model (LiLT fine-tuned on DocLayNet dataset)

Inference APP

To read about this APP: Document AI | Inference APP for Document Understanding at line level

About the author: Pierre Guillou is an AI consultant in Brazil and France. Get in touch with him through his LinkedIn profile.

--

--