Document AI | DocLayNet image viewer APP

Pierre Guillou
4 min readJan 31, 2023

--

Images with bounding boxes of labeled paragraphs and lines from DocLayNet
Images with bounding boxes of labeled paragraphs and lines from DocLayNet

After creating different formats (small, basic, and large) to download DocLayNet (formats small/base/large that also help using DocLayNet data in Hugging Face notebooks on finetuning document layout models), it was important to have an APP for viewing annotated images with labeled bounding boxes of paragraphs and lines. Indeed, this visualization helps to better understand the nature of DocLaynet data and its research/commercial potential in Document AI. This post explains how this APP works and gives access to its online space and notebook.

Credit: DocLayNet is a dataset of IBM Research.

Notebook: DocLayNet image viewer APP

To read (Layout XLM base — paragraph level)

To read (Layout XLM base — line level)

To read (LiLT base — paragraph level)

To read (LiLT base — line level)

DocLayNet dataset

DocLayNet dataset (IBM Research) provides page-by-page layout segmentation ground-truth using bounding-boxes for 11 distinct class labels (Caption, Footnote, Formula, List-item, Page-footer, Page-header, Picture, Section-header, Table, Text, Title) on 80863 unique pages from 6 document domains (Financial Reports, Manuals, Scientific Articles, Laws & Regulations, Patents, Government Tenders).

Until today, the dataset can be downloaded through direct links or as a dataset from Hugging Face datasets:

DocLayNet small/base/large

The 2 downloading options cited in the paragraph “DocLayNet dataset” require the downloading of all the data (approximately 30GBi), which requires downloading time (about 45 mn in Google Colab) and a large space on the hard disk. These could limit experimentation for people with low resources.

Moreover, even when using the download via HF datasets library, it is necessary to download the EXTRA zip separately (doclaynet_extra.zip, 7.5 GiB) to associate the annotated bounding boxes with the text extracted by OCR from the PDFs. This operation also requires additional code because the boundings boxes of the texts do not necessarily correspond to those annotated (a calculation of the percentage of area in common between the boundings boxes annotated and those of the texts makes it possible to make a comparison between them).

At last, in order to use Hugging Face notebooks on fine-tuning layout models like LayoutLMv3 or LiLT, DocLayNet data must be processed in a proper format.

For all these reasons, the DocLayNet dataset was processed:

  • into 3 datasets of different sizes (random selection respectively in the train, val and test files for the versions small and base):

DocLayNet small (about 1% of DocLayNet) < 1.000k document images (691 train, 64 val, 49 test)

DocLayNet base (about 10% of DocLayNet) < 10.000k document images (6.910 train, 648 val, 499 test)

DocLayNet large (about 100% of DocLayNet) < 100.000k document images (69.103 train, 6.480 val, 4.994 test)

  • with associated texts,
  • and in a format facilitating their use by HF notebooks.

You can get more information about these 3 DocLayNet formats in the blog post “Document AI | Processing of DocLayNet dataset to be used by layout models of the Hugging Face hub (finetuning, inference)”.

Note: the layout HF notebooks will greatly help participants of the IBM ICDAR 2023 Competition on Robust Layout Segmentation in Corporate Documents!

DocLayNet image viewer

DocLayNet image viewer (video)

About the APP

After creating different formats (small, base and large) for downloading DocLayNet, it seemed important to have an APP for viewing annotated images with labeled bounding boxes.

Indeed, this visualization makes it possible to better understand the nature of DocLayNet data and their research/commercial potential in Document AI.

For example, it is possible to get:

  • a visualization of labeled bounding boxes by paragraph or line,
  • the textual content of the labeled bounding boxes formatted according to the corresponding label (for the textual content with the label “Text”, there is no breakline, but for the content “Table” or “Header”, there is for example)
  • the coordinates of the bounding boxes of labeled paragraphs and lines,
  • PDF image download with and without bounding boxes,
  • the dataframes of labeled bounding boxes data (labels, texts, coordinates) of paragraphs and lines,

In addition, the APP allows selecting the domain of the PDF to be randomly selected, and optionally the category (label) which should be part of the categories list of the PDF.

Here, the domains and categories (labels) lists of DocLayNet:

  • List of the 6 PDFs domains of DocLayNet: Financial Reports, Manuals, Scientific Articles, Laws & Regulations, Patents, Government Tenders
  • List of the 11 DocLayNet categories (labels): Caption, Footnote, Formula, List-item, Page-footer, Page-header, Picture, Section-header, Table, Text, Title

APP in Hugging Face Spaces

This APP is now available online in Spaces of Hugging Face:

APP notebook

You can also run this APP in Google Colab by running this notebook: DocLayNet image viewer APP

About the author: Pierre Guillou is an AI consultant in Brazil and France. Get in touch with him through his LinkedIn profile.

--

--