Document AI | Processing of DocLayNet dataset to be used by layout models of the Hugging Face hub (finetuning, inference)

Pierre Guillou
7 min readJan 27, 2023

--

This diagram illustrates all of the key document processing steps that are supported by Document AI and how they can connect to each other (Credit and image fonte: Google Cloud)
This diagram illustrates all of the key document processing steps that are supported by Document AI and how they can connect to each other (Credit and image fonte: Google Cloud)

Document AI is still a new area of NLP but affects all businesses and individuals. It consists of using AI models to visually and textually understand the content of documents such as PDFs. Thus, it is then possible to categorize the text, images and tables of documents (header, main text, footer, etc.), to extract this data, to carry out targeted searches, to classify documents, etc. To train these AI models, it is necessary to have annotated documents (bounding boxes, categories, extracted text, etc.). IBM Research has just greatly helped research in Document AI through the online publication of its DocLayNet dataset. In order to facilitate its use by as many people as possible and in particular to facilitate its use with Deep learning layout models and notebooks published by Hugging Face, this post describes how to use different versions of this dataset.

DocLayNet datasets

Notebook: Processing of DocLayNet dataset to be used by layout models of the Hugging Face hub (finetuning, inference)

APP: DocLayNet image viewer

To read (Layout XLM base — paragraph level)

To read (Layout XLM base — line level)

To read (LiLT base — paragraph level)

To read (LiLT base — line level)

DocLayNet dataset

DocLayNet dataset (IBM Research) provides page-by-page layout segmentation ground-truth using bounding-boxes for 11 distinct class labels on 80863 unique pages from 6 document categories.

Until today, the dataset can be downloaded through direct links or as a dataset from Hugging Face datasets:

About PDFs languages

Citation of the page 3 of the DocLayNet paper:

We did not control the document selection with regard to language. The vast majority of documents contained in DocLayNet (close to 95%) are published in English language. However, DocLayNet also contains a number of documents in other languages such as German (2.5%), French (1.0%) and Japanese (1.0%). While the document language has negligible impact on the performance of computer vision methods such as object detection and segmentation models, it might prove challenging for layout analysis methods which exploit textual features.

About PDFs categories distribution

Citation of the page 3 of the DocLayNet paper:

The pages in DocLayNet can be grouped into six distinct categories, namely Financial Reports, Manuals, Scientific Articles, Laws & Regulations, Patents and Government Tenders. Each document category was sourced from various repositories. For example, Financial Reports contain both free-style format annual reports which expose company-specific, artistic layouts as well as the more formal SEC filings. The two largest categories (Financial Reports and Manuals) contain a large amount of free-style layouts in order to obtain maximum variability. In the other four categories, we boosted the variability by mixing documents from independent providers, such as different government websites or publishers. In Figure 2, we show the document categories contained in DocLayNet with their respective sizes.

DocLayNet PDFs categories distribution (source: DocLayNet paper)
DocLayNet PDFs categories distribution (source: DocLayNet paper)

Processing into a format facilitating its use by HF notebooks

The 2 downloading options cited in the paragraph “DocLayNet dataset” require the downloading of all the data (approximately 30GBi), which requires downloading time (about 45 mn in Google Colab) and a large space on the hard disk. These could limit experimentation for people with low resources.

Moreover, even when using the download via HF datasets library, it is necessary to download the EXTRA zip separately (doclaynet_extra.zip, 7.5 GiB) to associate the annotated bounding boxes with the text extracted by OCR from the PDFs. This operation also requires additional code because the boundings boxes of the texts do not necessarily correspond to those annotated (a calculation of the percentage of area in common between the boundings boxes annotated and those of the texts makes it possible to make a comparison between them).

At last, in order to use Hugging Face notebooks on fine-tuning layout models like LayoutLMv3 or LiLT, DocLayNet data must be processed in a proper format.

For all these reasons, I decided to process the DocLayNet dataset:

  • into 3 datasets of different sizes (random selection respectively in the train, val and test files for the versions small and base):

DocLayNet small (about 1% of DocLayNet) < 1.000k document images (691 train, 64 val, 49 test)

DocLayNet base (about 10% of DocLayNet) < 10.000k document images (6.910 train, 648 val, 499 test)

DocLayNet large (about 100% of DocLayNet) < 100.000k document images (69.103 train, 6.480 val, 4.994 test)

  • with associated texts,
  • and in a format facilitating their use by HF notebooks.

Note: the layout HF notebooks will greatly help participants of the IBM ICDAR 2023 Competition on Robust Layout Segmentation in Corporate Documents!

DocLayNet small

The size of the DocLayNet small is about 1% of the DocLayNet dataset (random selection respectively in the train, val and test files).

Download code

# !pip install -q datasets

from datasets import load_dataset

dataset_small = load_dataset("pierreguillou/DocLayNet-small")

Dataset overview

DatasetDict({
train: Dataset({
features: ['id', 'texts', 'bboxes_block', 'bboxes_line', 'categories', 'image', 'pdf', 'page_hash', 'original_filename', 'page_no', 'num_pages', 'original_width', 'original_height', 'coco_width', 'coco_height', 'collection', 'doc_category'],
num_rows: 691
})
validation: Dataset({
features: ['id', 'texts', 'bboxes_block', 'bboxes_line', 'categories', 'image', 'pdf', 'page_hash', 'original_filename', 'page_no', 'num_pages', 'original_width', 'original_height', 'coco_width', 'coco_height', 'collection', 'doc_category'],
num_rows: 64
})
test: Dataset({
features: ['id', 'texts', 'bboxes_block', 'bboxes_line', 'categories', 'image', 'pdf', 'page_hash', 'original_filename', 'page_no', 'num_pages', 'original_width', 'original_height', 'coco_width', 'coco_height', 'collection', 'doc_category'],
num_rows: 49
})
})

DocLayNet base

The size of the DocLayNet base is about 10% of the DocLayNet dataset (random selection respectively in the train, val and test files).

Download code

# !pip install -q datasets

from datasets import load_dataset

dataset_base = load_dataset("pierreguillou/DocLayNet-base")

Dataset overview

DatasetDict({
train: Dataset({
features: ['id', 'texts', 'bboxes_block', 'bboxes_line', 'categories', 'image', 'pdf', 'page_hash', 'original_filename', 'page_no', 'num_pages', 'original_width', 'original_height', 'coco_width', 'coco_height', 'collection', 'doc_category'],
num_rows: 6910
})
validation: Dataset({
features: ['id', 'texts', 'bboxes_block', 'bboxes_line', 'categories', 'image', 'pdf', 'page_hash', 'original_filename', 'page_no', 'num_pages', 'original_width', 'original_height', 'coco_width', 'coco_height', 'collection', 'doc_category'],
num_rows: 648
})
test: Dataset({
features: ['id', 'texts', 'bboxes_block', 'bboxes_line', 'categories', 'image', 'pdf', 'page_hash', 'original_filename', 'page_no', 'num_pages', 'original_width', 'original_height', 'coco_width', 'coco_height', 'collection', 'doc_category'],
num_rows: 499
})
})

DocLayNet large

The size of the DocLayNet large is about 100% of the DocLayNet dataset.

WARNING: the following code allows to download DocLayNet large but it can not run until the end in Google Colab because of the size needed to store cache data and the CPU RAM to download the data (for example, the cache data in /home/ubuntu/.cache/huggingface/datasets/ needs almost 120 GB during the downloading process). And even with a suitable instance, the download time of the DocLayNet large dataset is around 1h50. This is one more reason to test your fine-tuning code on DocLayNet small and/or DocLayNet base 😊

Download code

# !pip install -q datasets

from datasets import load_dataset

dataset_large = load_dataset("pierreguillou/DocLayNet-large")

Dataset overview

DatasetDict({
train: Dataset({
features: ['id', 'texts', 'bboxes_block', 'bboxes_line', 'categories', 'image', 'pdf', 'page_hash', 'original_filename', 'page_no', 'num_pages', 'original_width', 'original_height', 'coco_width', 'coco_height', 'collection', 'doc_category'],
num_rows: 69103
})
validation: Dataset({
features: ['id', 'texts', 'bboxes_block', 'bboxes_line', 'categories', 'image', 'pdf', 'page_hash', 'original_filename', 'page_no', 'num_pages', 'original_width', 'original_height', 'coco_width', 'coco_height', 'collection', 'doc_category'],
num_rows: 6480
})
test: Dataset({
features: ['id', 'texts', 'bboxes_block', 'bboxes_line', 'categories', 'image', 'pdf', 'page_hash', 'original_filename', 'page_no', 'num_pages', 'original_width', 'original_height', 'coco_width', 'coco_height', 'collection', 'doc_category'],
num_rows: 4994
})
})

Visualizing annotated bounding boxes (by paragraphes and lines)

The DocLayNet small and base makes easy to display document image with the annotaed bounding boxes of paragraphes or lines.

Check the notebook processing_DocLayNet_dataset_to_be_used_by_layout_models_of_HF_hub.ipynb in order to get the code.

Paragraphes

Annotated DocLayNet document image with bounding boxes and categories of paragraphes
Annotated DocLayNet document image with bounding boxes and categories of paragraphes

Lines

Annotated DocLayNet document image with bounding boxes and categories of lines
Annotated DocLayNet document image with bounding boxes and categories of lines

HF notebooks

About the author: Pierre Guillou is an AI consultant in Brazil and France. Get in touch with him through his LinkedIn profile.

--

--