Document AI | Inference APP for Document Understanding at line level
Through the publication of the DocLayNet dataset (IBM Research) and the publication of Document Understanding models on Hugging Face (for example: LayoutLM series and LiLT), a Document Understanding model has been fine-tuned and published: a LiLT base model at line level with overlap chunks of 384 tokens that uses the XLM-RoBERTa base model. This model can label all lines on all pages of any document (like a PDF) in any language with 11 labels (Caption, Footnote, Formula, List-item, Page-footer, Page-header, Picture, Section-header, Table, Text, Title). This post presents the Inference APP and its associated notebook to test this model, and an inference notebook to put it into production.
Notebooks
- (APP) Document AI | Inference APP at line level with a Document Understanding model (LiLT fine-tuned on DocLayNet dataset)
- (production) Document AI | Inference at line level with a Document Understanding model (LiLT fine-tuned on DocLayNet dataset
To read (Layout XLM base — paragraph level)
- (04/01/2023) Document AI | APP to compare the Document Understanding LiLT and LayoutXLM (base) models at paragraph level
- (03/31/2023) Document AI | Inference APP and fine-tuning notebook for Document Understanding at paragraph level with LayoutXLM base
- (03/25/2023) Document AI | APP to compare the Document Understanding LiLT and LayoutXLM (base) models at line level
To read (Layout XLM base — line level)
- (03/05/2023) Document AI | Inference APP and fine-tuning notebook for Document Understanding at line level with LayoutXLM base
To read (LiLT base — paragraph level)
- (02/16/2023) Document AI | Inference APP and fine-tuning notebook for Document Understanding at paragraph level
To read (LiLT base — line level)
- (02/14/2023) Document AI | Inference APP for Document Understanding at line level
- (02/10/2023) Document AI | Document Understanding model at line level with LiLT, Tesseract and DocLayNet dataset
- (01/31/2023) Document AI | DocLayNet image viewer APP
- (01/27/2023) Document AI | Processing of DocLayNet dataset to be used by layout models of the Hugging Face hub (finetuning, inference)
Document Understanding model at line level
I recently finetuned a LiLT base model on the DocLayNet base dataset.
More precisely, I did the fine-tuning of a LiLT base model at line level with overlap chunks of 384 tokens that uses the XLM-RoBERTa base model: nielsr/lilt-xlm-roberta-base (check this notebook).
Thus, this model allows to have a finetuned Document Understanding model that works on more than 100 languages (see XLM RoBERTa)!
This work was made possible thanks to notebooks published by Hugging Face and in particular by Niels Rogge and Philipp Schmid:
- LayoutLM, LayoutLMv2, LayoutLMv3, LayoutXLM, LiLT notebooks (Niels Rogge)
- Document AI: LiLT a better language agnostic LayoutLM model (Philipp Schmid)
Many thanks to them and to IBM Research team!
Inference APP & notebooks
In order to test this model or use it in production, you now have an APP and 2 notebooks :-)
Inference APP
The Inference APP is hosted on Hugging Face Space and uses the model LiLT base combined with XLM-RoBERTa base and finetuned on the dataset DocLayNet base at line level (chunk size of 384 tokens).
LiLT (Language-Independent Layout Transformer) is a Document Understanding model that uses both layout and text in order to detect labels of bounding boxes.
Combined with the model XML-RoBERTa base, this finetuned model has the capacity to understand any language. Finetuned on the dataset DocLayNet base, it can classifly any bounding box (and its OCR text) to 11 labels (Caption, Footnote, Formula, List-item, Page-footer, Page-header, Picture, Section-header, Table, Text, Title).
It relies on an external OCR engine to get words and bounding boxes from the document image. Thus, let’s run in this APP an OCR engine (PyTesseract) to get the bounding boxes, then run LiLT (already fine-tuned on the dataset DocLayNet base at line level) on the individual tokens and then, visualize the result at line level!
It allows to get all pages of any PDF (of any language) with bounding boxes labeled at line level and the associated dataframes with labeled data (bounding boxes, texts, labels) :-)
However, the inference time per page can be high when running the model on CPU due to the number of line predictions to be made. Therefore, to avoid running this APP for too long, only the first 2 pages are processed by this APP.
If you want to increase this limit, you can either clone this APP in Hugging Face Space (or run its notebook on your own plateform) and change the value of the parameter max_imgboxes
, or run the inference notebook "Document AI | Inference at line level with a Document Understanding model (LiLT fine-tuned on DocLayNet dataset)" on your own platform as it does not have this limit.
Inference notebooks
As already mentioned, 2 inference notebooks are available online: one is the APP notebook which allows to run the Gradio APP on any platform like Google Colab, and the other is to use the model in production without limit on the number of pages labeled by the model.
Notebooks:
- (APP) Document AI | Inference APP at line level with a Document Understanding model (LiLT fine-tuned on DocLayNet dataset)
- (production) Document AI | Inference at line level with a Document Understanding model (LiLT fine-tuned on DocLayNet dataset
About the author: Pierre Guillou is an AI consultant in Brazil and France. Get in touch with him through his LinkedIn profile.