Document AI | APP to compare the Document Understanding LiLT and LayoutXLM (base) models at paragraph level

Pierre Guillou
5 min readApr 1, 2023

--

APP to compare the Document Understanding LiLT and LayoutXLM (base) models at paragraph level
APP to compare the Document Understanding LiLT and LayoutXLM (base) models at paragraph level

Through the publication of the DocLayNet dataset (IBM Research) and the publication of Document Understanding models on Hugging Face (for example: LayoutLM series and LiLT), 2 Document Understanding models at paragraph level have already been published: a LiLT base and a LayoutXLM base (Microsoft) models finetuned on the dataset DocLayNet base with overlap chunks of 512 tokens that uses the XLM-RoBERTa base tokenizer. These models can label all paragraphs on all pages of any document (like a PDF) in any language with 11 labels (Caption, Footnote, Formula, List-item, Page-footer, Page-header, Picture, Section-header, Table, Text, Title). Today, we published an online APP and its notebook to compare the output of these 2 models.

Notebook: Document AI | Inference APP at paragraph level with 2 Document Understanding models (LiLT base and LayoutXLM base fine-tuned on DocLayNet base dataset)

To read (Layout XLM base — paragraph level)

To read (Layout XLM base — line level)

To read (LiLT base — paragraph level)

To read (LiLT base — line level)

DocLayNet + Layout models in Open Source: Document AI is (truly) starting!

The recent publication of the DocLayNet dataset (IBM Research) and that of Document Understanding models (by the detection of layout and texts) on Hugging Face (LayoutLM, LayoutLMv2, LayoutLMv3, LayoutXLM, LiLT), allow the training of such models on PDFs, slides, images with text (etc.) with labels that interest the greatest number (Header, Footer, Title, Text, Table, Figure, etc.).

The pages in DocLayNet can be grouped into six distinct categories, namely Financial Reports, Manuals, Scientific Articles, Laws & Regulations, Patents and Government Tenders.
The pages in DocLayNet can be grouped into six distinct categories, namely Financial Reports, Manuals, Scientific Articles, Laws & Regulations, Patents and Government Tenders.

Many companies and individuals are waiting for such models. Indeed, being able to automatically and quickly extract labeled text from its documents makes it possible to fully exploit them to search for information, classify documents, interact with them via different NLP models such as QA, NER or even chatbots (humm… who is talking about ChatGPT here?)

Moreover, in order to encourage AI professionals to train this kind of model, IBM Research has just launched a competition: ICDAR 2023 Competition on Robust Layout Segmentation in Corporate Documents.

DocLayNet small/base/large and a DocLayNet Image Viewer APP: explore data to better understand it

In this context and in order to help as many people as possible to explore and better understand the DocLayNet dataset, I have already published 6 projects:

  1. the DocLayNet small, base, large datasets to facilitate the use of DocLayNet with annotated text (and not only with bounding boxes) (to read: “Document AI | Processing of DocLayNet dataset to be used by layout models of the Hugging Face hub (finetuning, inference)”);
  2. an APP (DocLayNet image viewer) to visualize the annotated bounding boxes of lines and paragraphs of the documents of the (to read: “Document AI | DocLayNet image viewer APP”).
  3. a LiLT base model finetuned on the dataset DocLayNet base with overlap chunks of 384 tokens at line level that uses the XLM-RoBERTa base model and its inference app and production code
  4. a LiLT base model finetuned on the dataset DocLayNet base with overlap chunks of 512 tokens at paragraph level that uses the XLM-RoBERTa base model and its inference app and production code.
  5. a LayoutXLM base model finetuned on the dataset DocLayNet base with overlap chunks of 384 tokens at line level that uses the XLM-RoBERTa base tokenizer and its inference app and production code.
  6. a LayoutXLM base model finetuned on the dataset DocLayNet base with overlap chunks of 512tokens at paragraph level and its inference app and production code.

Let’s compare our 2 models (LiLT vs LayoutXLM)

APP

In order to compare these 2 models, there is an APP now :-)

Notebook with Gradio APP

Here, the App notebook :-)

This notebook runs a Gradio App that processes the first page of any uploaded PDF. As done by our other Document Understanding APPs, this APP displays not only the paragraph labelled image of the first page for each of the 2 models but also the DataFrame of labelled texts.

Gradio App that processes the first page of any uploaded PDF and displays not only the paragraph labelled image of the first page for each of the 2 models but also the DataFrame of labelled texts
Gradio App that processes the first page of any uploaded PDF and displays not only the paragraph labelled image of the first page for each of the 2 models but also the DataFrame of labelled texts

This notebook can be run on Google Colab. It is hosted in github.

Notebook: Gradio_inference_on_LiLT_&_LayoutXLM_base_model_finetuned_on_DocLayNet_base_in_any_language_at_levelparagraphs_ml512.ipynb

Example

Let’s look at a report from the European Commission.

Page 1

Our Gradio app renders the first page of this PDF.

First page of a PDF processed by our Document Understanding LiLT base model (left) and LayoutXLM base model (right) at paragraph level
First page of a PDF processed by our Document Understanding LiLT base model (left) and LayoutXLM base model (right) at paragraph level

We can see from the paragraph labeled images that there are differences: our Document Understanding LiLT base model seems to work better:

  • labeled Page Header text well,
  • it does a better job of labeling texts blocks.

However, the 2 models failed to label the title of the page.

Page 2

Second page of a PDF processed by our Document Understanding LiLT base model (left) and LayoutXLM base model (right) at paragraph leve
Second page of a PDF processed by our Document Understanding LiLT base model (left) and LayoutXLM base model (right) at paragraph level

This time, we can see from the paragraph labeled images that there are again differences BUT this is our Document Understanding LayoutXLM base model that seems to work better:

  • it detects very well the Sub-Header.

About the author: Pierre Guillou is an AI consultant in Brazil and France. Get in touch with him through his LinkedIn profile

--

--

Pierre Guillou

AI, Generative AI, Deep learning, NLP models author | Europe (Paris, Bruxelles, Liège) & Brazil