Document AI | APP to compare the Document Understanding LiLT and LayoutXLM (base) models at line level

Pierre Guillou
5 min readMar 25, 2023

--

APP to compare the Document Understanding LiLT and LayoutXLM (base) models at line level
APP to compare the Document Understanding LiLT and LayoutXLM (base) models at line level

Through the publication of the DocLayNet dataset (IBM Research) and the publication of Document Understanding models on Hugging Face (for example: LayoutLM series and LiLT), 2 Document Understanding models at line level have already been published: a LiLT base and a LayoutXLM base (Microsoft) models finetuned on the dataset DocLayNet base with overlap chunks of 384 tokens that uses the XLM-RoBERTa base tokenizer. These models can label all lines on all pages of any document (like a PDF) in any language with 11 labels (Caption, Footnote, Formula, List-item, Page-footer, Page-header, Picture, Section-header, Table, Text, Title). Today, we put online an APP and published itsnotebook to compare the output of these 2 models.

Notebook: Document AI | Inference APP at line level with 2 Document Understanding models (LiLT base and LayoutXLM base fine-tuned on DocLayNet base dataset)

To read (Layout XLM base — paragraph level)

To read (Layout XLM base — line level)

To read (LiLT base — paragraph level)

To read (LiLT base — line level)

DocLayNet + Layout models in Open Source: Document AI is (truly) starting!

The recent publication of the DocLayNet dataset (IBM Research) and that of Document Understanding models (by the detection of layout and texts) on Hugging Face (LayoutLM, LayoutLMv2, LayoutLMv3, LayoutXLM, LiLT), allow the training of such models on PDFs, slides, images with text (etc.) with labels that interest the greatest number (Header, Footer, Title, Text, Table, Figure, etc.).

The pages in DocLayNet can be grouped into six distinct categories, namely Financial Reports, Manuals, Scientific Articles, Laws & Regulations, Patents and Government Tenders.
The pages in DocLayNet can be grouped into six distinct categories, namely Financial Reports, Manuals, Scientific Articles, Laws & Regulations, Patents and Government Tenders.

Many companies and individuals are waiting for such models. Indeed, being able to automatically and quickly extract labeled text from its documents makes it possible to fully exploit them to search for information, classify documents, interact with them via different NLP models such as QA, NER or even chatbots (humm… who is talking about ChatGPT here?)

Moreover, in order to encourage AI professionals to train this kind of model, IBM Research has just launched a competition: ICDAR 2023 Competition on Robust Layout Segmentation in Corporate Documents.

DocLayNet small/base/large and a DocLayNet Image Viewer APP: explore data to better understand it

In this context and in order to help as many people as possible to explore and better understand the DocLayNet dataset, I have already published 5 projects:

  1. the DocLayNet small, base, large datasets to facilitate the use of DocLayNet with annotated text (and not only with bounding boxes) (to read: “Document AI | Processing of DocLayNet dataset to be used by layout models of the Hugging Face hub (finetuning, inference)”);
  2. an APP (DocLayNet image viewer) to visualize the annotated bounding boxes of lines and paragraphs of the documents of the (to read: “Document AI | DocLayNet image viewer APP”).
  3. a LiLT base model finetuned on the dataset DocLayNet base with overlap chunks of 384 tokens at line level that uses the XLM-RoBERTa base model and its inference app and production code
  4. a LiLT base model finetuned on the dataset DocLayNet base with overlap chunks of 512 tokens at paragraph level that uses the XLM-RoBERTa base model and its inference app and production code.
  5. a LayoutXLM base model finetuned on the dataset DocLayNet base with overlap chunks of 384 tokens at line level that uses the XLM-RoBERTa base tokenizer and its inference app and production code.

Let’s compare our 2 models (LiLT vs LayoutXLM)

APP

In order to compare these 2 models, there is an APP :-)

Notebook with Gradio APP

Here is the APP notebook :-)

This notebook runs a Gradio App that processes the first page of any uploaded PDF. As done by our other Document Understanding APPs, this APP displays not only the line labelled image of the first page for each of the 2 models but also the DataFrame of labelled texts.

Gradio App that processes the first page of any uploaded PDF and displays not only the line labelled image of the first page for each of the 2 models but also the DataFrame of labelled texts
Gradio App that processes the first page of any uploaded PDF and displays not only the line labelled image of the first page for each of the 2 models but also the DataFrame of labelled texts

This notebook can be run on Google Colab. It is hosted in github.

Notebook: Gradio_inference_on_LiLT_&_LayoutXLM_base_model_finetuned_on_DocLayNet_base_in_any_language_at_levellines_ml384.ipynb

Example

Let’s look at a report from the European Commission. Our Gradio app renders the first page of this PDF.

First page of a PDF processed by our Document Understanding LiLT base model (left) and LayoutXLM base model (right)
First page of a PDF processed by our Document Understanding LiLT base model (left) and LayoutXLM base model (right)

We can see from the line labeled images that there are differences: our Document Understanding LayoutXLM base model seems to work better:

  • labeled all Title and Footer texts well,
  • it does a better job of labeling bullet texts with Table instead of Section-Header

However, the 2 models failed to label the word BRIEFING as Header.

About the author: Pierre Guillou is an AI consultant in Brazil and France. Get in touch with him through his LinkedIn profile

--

--

Pierre Guillou

AI, Generative AI, Deep learning, NLP models author | Europe (Paris, Bruxelles, Liège) & Brazil