Document Image Datasets

3 min readJul 7, 2019

This is a living post that I’ll update periodically for trying to track and summarize publicly available datasets related to document analysis.

General Document Analysis

IIT Dataset

After reading through LayoutLM, I discovered the much larger dataset IIT, that RVL-CDIP and several other document image analysis datasets are based on. It is huge. From the readme:

https://ir.nist.gov/cdip/README.txt

PubLayNet

https://arxiv.org/abs/1908.07836

PubTabNet

https://arxiv.org/abs/1911.10683

Document Image Classification

RVL-CDIP

It could be posited that the RVL-CDIP could be looked at as the equivalent of ImageNet for the document image community, and functions as one of the more challenging classification benchmarks in document image classification. The dataset contains much noise and variance in composition of each document class. Uncompressed, the dataset size is ~100GB, and comprises 16 classes of document types, with 25,000 samples per classes. Example classes include email, resume, and invoice.

SD02

The NIST Structured Forms Database or “SD-02” dataset is the consists of 5,590 pages of binary, black-and-white images of synthesized documents. The documents comprise tax forms from the IRS from 1988. Eight of these forms contain two pages or form faces; therefore, there are 20 different form faces represented in the database. The document images in this database appear to be real forms prepared by individuals, but the images have been automatically derived and synthesized using a computer. Overall, this dataset is much easier to classify with CNN’s than RVL-CDIP. L Kang et al showed that classification with perfect accuracy on this dataset was possible with very few training training examples (link).

Tobacco-3482

Tobacco-3482 is a smaller dataset than RVL-CDIP but somewhat similar. In fact some of the images in Tobacco-3482 are in RVL-CDIP. Tobacco-3482 consists of ten classes, and just like the title says — it has 3,482 images.

Medical Article Records Groundtruth (MARG)

From the website:

MARG is a freely-available repository of document page images and their associated textual and layout data. The data has been reviewed and corrected to establish its “ground truth”. Research in document image analysis and understanding is greatly facilitated by such repositories for the design, training, and testing of algorithms for data identification and extraction.

The dataset is quite interesting. Each class is represented by a scientific journal layout, well-explained by the figure below. There are in total 1,553, with quite imbalanced class distribution. From the papers I’ve read in doing classification on this dataset, the “Other” class is omitted. We report results on including and not including the “Other” class.