Paper Review — DocParser: Hierarchical Structure Parsing of Document Renderings

ArXiv link — https://arxiv.org/abs/1911.01702

Research problem

The paper focusing on the problem of document layout analysis. Parsing a document’s rendering into a machine readable hierarchical structure is a major part of many applications. Generating such a hierarchical structure is a challenging tasks due to variations in the entities(lists can be ordered as well as unordered), variations in the structure of a document (one column, two column, etc), also the entities can be arbitrarily nested (a list in a table cell).

Contributions

In this paper, the authors -

  • introduce an end-to-end system for parsing structure of documents including all text elements, figures, tables and table cells.
  • have released a dataset “arXivdocs” for evaluating their hierarchical document structure parser based on 127,472 scientific articles from arXiv repository. Dataset can be found here — https://github.com/DS3Lab/arXivdocs
  • have proposed a novel scalable weak supervision learning framework for problems where domain-specific data is scarce.
  • they have also released their source code — https://github.com/DS3Lab/DocParser

--

--