Paper Review — DocParser: Hierarchical Structure Parsing of Document Renderings

Johannes Rausch, Octavio Martinez, Fabian Bissig, Ce Zhang, Stefan Feuerriegel

Krut Patel

Published in

The Startup

4 min readSep 23, 2020

ArXiv link — https://arxiv.org/abs/1911.01702

Research problem

The paper focusing on the problem of document layout analysis. Parsing a document’s rendering into a machine readable hierarchical structure is a major part of many applications. Generating such a hierarchical structure is a challenging tasks due to variations in the entities(lists can be ordered as well as unordered), variations in the structure of a document (one column, two column, etc), also the entities can be arbitrarily nested (a list in a table cell).

Contributions

In this paper, the authors -

introduce an end-to-end system for parsing structure of documents including all text elements, figures, tables and table cells.
have released a dataset “arXivdocs” for evaluating their hierarchical document structure parser based on 127,472 scientific articles from arXiv repository. Dataset can be found here — https://github.com/DS3Lab/arXivdocs
have proposed a novel scalable weak supervision learning framework for problems where domain-specific data is scarce.
they have also released their source code — https://github.com/DS3Lab/DocParser

Paper Review — DocParser: Hierarchical Structure Parsing of Document Renderings

Johannes Rausch, Octavio Martinez, Fabian Bissig, Ce Zhang, Stefan Feuerriegel

Research problem

Contributions

Written by Krut Patel