Document AI | Document Understanding model at line level with LiLT, Tesseract and DocLayNet dataset

5 min readFeb 10, 2023

Document Understanding model at line level with LiLT, Tesseract and DocLayNet dataset

The publication of the DocLayNet dataset (IBM Research) and that of Document Understanding models on Hugging Face (for example: LayoutLM series and LiLT) allow (at last!) the training of such models on all documents with text (for example: PDFs, slides, images with text) with labels that interest the greatest number (for example: Header, Footer, Title, etc.). Many companies are waiting for such models in order to fully exploit their documents or interact with them via different NLP models or even chatbots (humm… who is talking about ChatGPT here?). This post presents the results of the training of such a Document Understanding model, and also its finetuning and inference code via 2 notebooks. The finetuned model is also freely available online at the Hugging Face hub.

pierreguillou/lilt-xlm-roberta-base-finetuned-with-DocLayNet-base-at-linelevel-ml384 · Hugging Face

We're on a journey to advance and democratize artificial intelligence through open source and open science.

huggingface.co

Notebooks

Document AI | Fine-tune LiLT on DocLayNet base in any language at line level (chunk of 384 tokens with overlap)
Document AI | Inference at line level with a Document Understanding model (LiLT fine-tuned on DocLayNet dataset) (updated date: 02/12/2023 (first date: 02/10/2023 — main changes: detection file type as pdf, detection PDF text languages))

To read (Layout XLM base — paragraph level)

To read (Layout XLM base — line level)

(03/05/2023) Document AI | Inference APP and fine-tuning notebook for Document Understanding at line level with LayoutXLM base

To read (LiLT base — paragraph level)

(02/16/2023) Document AI | Inference APP and fine-tuning notebook for Document Understanding at paragraph level

To read (LiLT base — line level)

DocLayNet + Layout models in Open Source: Document AI is (truly) starting!

The recent publication of the DocLayNet dataset (IBM Research) and that of Document Understanding models (by the detection of layout and texts) on Hugging Face (LayoutLM, LayoutLMv2, LayoutLMv3, LayoutXLM, LiLT), allow the training of such models on PDFs, slides, images with text (etc.) with labels that interest the greatest number (Header, Footer, Title, Text, Table, Figure, etc.).

Many companies and individuals are waiting for such models. Indeed, being able to automatically and quickly extract labeled text from its documents makes it possible to fully exploit them to search for information, classify documents, interact with them via different NLP models such as QA, NER or even chatbots (humm… who is talking about ChatGPT here?)

Moreover, in order to encourage AI professionals to train this kind of model, IBM Research has just launched a competition: ICDAR 2023 Competition on Robust Layout Segmentation in Corporate Documents.

DocLayNet small/base/large and a DocLayNet Image Viewer APP: explore data to better understand it

In this context and in order to help as many people as possible to explore and better understand the DocLayNet dataset, I have already published 2 projects:

the DocLayNet small, base, large datasets to facilitate the use DocLayNet with annotated text (and not only with bounding boxes) (to read: “Document AI | Processing of DocLayNet dataset to be used by layout models of the Hugging Face hub (finetuning, inference)”);
an APP (DocLayNet image viewer) to visualize the annotated bounding boxes of lines and paragraphs of the documents of the (to read: “Document AI | DocLayNet image viewer APP”).

Next step: the code to fine-tune a Document Understanding model and get inferences :-)

The additional step was of course to train a first Document Understanding model with one of the DocLayNet dataset (small, base, large) and to publish the code (code of fine tuning and inference through notebooks) in order to help other AI professionals to train even more models and more efficient ones.

So, I finetuned a LiLT base model on the DocLayNet base dataset.

More precisely, I did the fine-tuning of a LiLT base model at line level with overlap chunks of 384 tokens that uses the XLM-RoBERTa base model: nielsr/lilt-xlm-roberta-base (check this notebook).

Thus, this model allows to have a finetuned Document Understanding model that will work on more than 100 languages (see XLM RoBERTa)!

This work was made possible thanks to notebooks published by Hugging Face and in particular by Niels Rogge and Philipp Schmid:

LayoutLM, LayoutLMv2, LayoutLMv3, LayoutXLM, LiLT notebooks (Niels Rogge)
Document AI: LiLT a better language agnostic LayoutLM model (Philipp Schmid)

Many thanks to them and to IBM Research team!

Finetuned Document Understanding model (at line level)

As a first model, the finetuning was done at the line level (a next model will be finetuned at the paragraphs level and the code is very similar).

The model was finetuned with chunks of a 384 token limit with overlap of 128 tokens in order to process all text on a page during the finetuning process.

pierreguillou/lilt-xlm-roberta-base-finetuned-with-DocLayNet-base-at-linelevel-ml384 · Hugging Face

We're on a journey to advance and democratize artificial intelligence through open source and open science.

huggingface.co

Notebook: Document AI | Fine-tune LiLT on DocLayNet base in any language at line level (chunk of 384 tokens with overlap)

Inference (at line level)

In order to obtain predicted labels at line level of a qualquer document in PDF format for example, an inference notebook has also been published.

It uses the Open Source OCR Tesseract to get bounding boxes and texts. Then, the probabilities of the model predictions are processed to obtain the labels of these bounding boxes. As the model was finetuned with chunks of a 384 token limit, we use the same idea of overlap (128 tokens) in order to get predictions for all text on a page.

Notebook: Document AI | Inference at line level with a Document Understanding model (LiLT fine-tuned on DocLayNet dataset)

Inference APP

Inference APP for Document Understanding at line level (v1) - a Hugging Face Space by pierreguillou

Discover amazing ML apps made by the community

huggingface.co

To read about this APP: Document AI | Inference APP for Document Understanding at line level

About the author: Pierre Guillou is an AI consultant in Brazil and France. Get in touch with him through his LinkedIn profile.

Document AI | Document Understanding model at line level with LiLT, Tesseract and DocLayNet dataset

pierreguillou/lilt-xlm-roberta-base-finetuned-with-DocLayNet-base-at-linelevel-ml384 · Hugging Face

We're on a journey to advance and democratize artificial intelligence through open source and open science.

DocLayNet + Layout models in Open Source: Document AI is (truly) starting!

DocLayNet small/base/large and a DocLayNet Image Viewer APP: explore data to better understand it

Next step: the code to fine-tune a Document Understanding model and get inferences :-)

Finetuned Document Understanding model (at line level)

pierreguillou/lilt-xlm-roberta-base-finetuned-with-DocLayNet-base-at-linelevel-ml384 · Hugging Face

We're on a journey to advance and democratize artificial intelligence through open source and open science.

Inference (at line level)

Inference APP

Inference APP for Document Understanding at line level (v1) - a Hugging Face Space by pierreguillou

Discover amazing ML apps made by the community

Written by Pierre Guillou