Development of a structure-aware PDF parser

Christian Hofer
6 min readSep 6, 2020

--

This article introduces briefly a PDF parsing library named pdfstructure that I am currently developing. The library attempts to capture the original document hierarchy and to make the relation between chapters, headers and paragraphs accessible in a generic way.

Why I started to develop a PDF parsing library

When I am working on customer projects that involve document parsing and textual data retrieval, I have to work myself through the same usual questions again and again:

  • In what format is the textual data presented?
  • Is it structured data accessible through a database?
  • Or is it just a zipped excerpt of a folder structure containing unstructured documents of various types like office documents and html files?

Many times I have been confronted with the latter scenario, where customer specific documents had to be parsed specifically for a particular use case.

Of course, libraries already exist that cover raw text extraction, but from my experience popular libraries like textract or pypdf2 focus on extracting just the raw text.

Why is that a problem?

Valuable information about the original document hierarchy capturing relations between paragraphs, chapter headers and so on is lost after parsing.
But exactly that textual structure can add significant value to use cases where it’s important to represent and process data systematically in a meaningful way.

The resulting performance of a project could be worse when unstructured data like text documents of different type, size and layout is modeled and applied in the same way to solve a business problem. For example when all kind of documents are directly ingested to a search index as they are without analyzing the underlying data first.

Figure 1 shows an example with

  • Document A as a user handbook that covers a range of topics in chapters and paragraphs in great detail
  • Document B as a single paged document with little amount of text
Fig 1.: Example workflow for working with documents including analysis, pre-processing, parsing, modelling and application.

Instead of adding the raw text of those documents directly to the search index, document A could be split up into its top level chapters (and keeping a link to the original document). Additionally the chapter title can of course be added as an analysed field to the search index. That could then boost such a sub-document to be identified as the correct search hit for a given query.

Introduction to automated textual structure parsing

I have started to develop the library pdfstructure in order to tackle the problem of parsing a documents structure independent of its layout in a generic way.

How does it work?

pdfstructure is built on top of pdfminer.six that provides:

  • Text extraction directly from the PDF’s source code
  • Exposure of exact location, font and color of the extracted text
  • Layout analysis to group text into lines and paragraphs

Adding hierarchy

pdfstructure adds a processing step on top of the extracted flat paragraph list and creates a nested tree structure that should represent the original hierarchy.

On a high level, the algorithm works as follows:

  1. Analyze distribution of occurring character style features like font-size and font-name for a given document
    Often tons of different font sizes are used within a single document.
    To make life easier, font sizes are mapped to predefined sizes like small, medium or large
  2. Iterate through the paragraphs and annotate each of them with its predominated style like the mapped text size and character weight (bold)
  3. Categorize each paragraph into header or content
  4. By leveraging the paragraph’s category, the document structure can be recreated as a general tree structure in one pass where
    — smaller headers are treated as a sub-section of larger headers (parent)
    — content paragraphs are children of a header paragraph

For humans it’s an easy task to group paragraphs accordingly based on visual cues like boldness or using the text size.

# 1) An easy example — GitHub Page as PDF
The following example document uses distinctive style features to define the documents structure:

Fig 2.: Exemplary document (Source: TSiege/The Technical Interview Cheat Sheet.md)

Figure 3 showcases a subset of the parsed tree structure for the prior document.

Fig 3.: Captured tree structure of interview_cheatsheet.pdf (excerpt)

# 2) A somewhat harder example — Book parsing

Book parsing can be harder since those are usually compiled of many chapters that include specific layout features like headers, footers or text boxes that highlight a specific paragraph.

The following image (Figure 4) showcases a brief side by side comparison of the parsed document and the original PDF “Kafka: The Definitive Guide”.

  • Left image with PyCharm debugging into the document model
  • Right image rendering the book using a PDF Viewer
Fig 4.: Comparison of the parsed document structure (left, PyCharm) to the original PDF (right, PDF Viewer)

Note: Textual structure parsing is based purely on text style analysis; any additional information like interactive links are not used.

Document Model

class StructuredDocument:
metadata: dict
sections: List[Section]
class Section:
content: TextElement
children: List[Section]
level: int
class TextElement:
text: LTTextContainer # the extracted paragraph from pdfminer
style: Style

Usage

The project is still in early development, but it is already able to handle and represent various kinds of documents pretty well.

Text extraction

from pdfstructure.hierarchy.parser import HierarchyParser
from pdfstructure.source import FileSource

parser = HierarchyParser()

# specify source (that implements source.read())
source = FileSource(path)

# analyse document and parse as nested data structure
document = parser.parse_pdf(source)

Export

The extracted text is stored as a tree and can be serialized to JSON, or for debugging purposes simply printed in a pretty string format.

from pdfstructure.printer import PrettyStringPrinter

pretty_string_printer = PrettyStringPrinter()
pretty_string = pretty_string_printer.print(document)
print(pretty_string)"
[Search Basics]
[Breadth First Search]
[Definition:]
An algorithm that searches a tree (or graph) by searching levels
of the tree first, starting at the root.
It finds every node on the same level, most often moving left to
right.
[What you need to know:]
Optimal for searching a tree that is wider than it is deep.
Uses a queue to store information about the tree while it
traverses a tree.
[Time Complexity:]
Search: Breadth First Search: O(V + E)
E is number of edges
V is number of vertices
"

The JsonFilePrinter implementation can be used to serialize the document to file (parsed example can be found here).

The document can of course be easily loaded from file whenever needed.

from pdfstructure.model.document import StructuredPdfDocument

json_string = json.load(file)
document = StructuredPdfDocument.from_json(json_string)

print(document.title)

"interview_cheatsheet.pdf"

Leveraging textual structure

Having all paragraphs and sections organised, its straight forward to iterate through the layers and search for specific elements like headlines, or extract all main headers like chapter titles.

A parsed document can be traversed using the in-order or level-order generator implementations respectively.

from pdfstructure.hierarchy.traversal import traverse_level_order

sections = [e for e in traverse_level_order(document, max_depth=2)]

The sections can then be used however necessary.
The previous parsed document could then yield sections as shown below:

     "Search"      "Sorting"
/ \ / \
"BFS" "DFS" "Merge" "Quick"
/ | \ | / | \ |
d w t d w t
## yield order ###
["Search", "Sorting", "BFS", "DFS", "Quick", "Merge"]

The source code can be found on GitHub at ChrizH/pdfstructure. The library is written in Python 3 and in pre-alpha.

I am happy for thoughts or input of any kind!

--

--

Christian Hofer

Passionate Software Engineer, problem solver and adventurer.