Paper Review — DocParser: Hierarchical Structure Parsing of Document Renderings
Johannes Rausch, Octavio Martinez, Fabian Bissig, Ce Zhang, Stefan Feuerriegel
Published in
4 min readSep 23, 2020
ArXiv link — https://arxiv.org/abs/1911.01702
Research problem
The paper focusing on the problem of document layout analysis. Parsing a document’s rendering into a machine readable hierarchical structure is a major part of many applications. Generating such a hierarchical structure is a challenging tasks due to variations in the entities(lists can be ordered as well as unordered), variations in the structure of a document (one column, two column, etc), also the entities can be arbitrarily nested (a list in a table cell).
Contributions
In this paper, the authors -
- introduce an end-to-end system for parsing structure of documents including all text elements, figures, tables and table cells.
- have released a dataset “arXivdocs” for evaluating their hierarchical document structure parser based on 127,472 scientific articles from arXiv repository. Dataset can be found here — https://github.com/DS3Lab/arXivdocs
- have proposed a novel scalable weak supervision learning framework for problems where domain-specific data is scarce.
- they have also released their source code — https://github.com/DS3Lab/DocParser