DATA STORIES | TEXT EXTRACTION | KNIME ANALYTICS PLATFORM

KNIME — Extract Text and Tables from PDF Files with Python in a Low-Code Environment

Whether you prefer a no-code or low-code extraction, KNIME has got it covered

Markus Lauber
Low Code for Data Science

--

On the KNIME Forum a frequently asked question is the extraction of data like tables and text from a PDF file. There has also been a “Just KNIME It!” challenge where you had to extract a table from a PDF with the help of KNIME nodes.

“A happy yellow robot learning in the style Albrecht Dürer” (DALL·E)
“A happy yellow robot learning in the style Albrecht Dürer” (DALL·E).

In this article, I would like to show some approaches how to extract tables as well as specific text from PDF files with the help of Python packages like Camelot and PyMuPDF/fitz — all within the comfort of your favourite low-code tool. We will use this dummy two page invoice:

A sample ‘invoice’ in a PDF with a table and addresses to extract
A sample ‘invoice’ in a PDF with a table and addresses to extract from two pages.

The KNIME Analytics Platform will provide a wrapper to help integrate the extraction into the rest of the workflow. All examples and functions shown in this article can be found in this KNIME workflow on the Hub:

Handle your PDF files in a KNIME workflow
Handle your PDF files in a KNIME workflow (https://hub.knime.com/-/spaces/-/~pmkrtx1AKm7Fcukw/current-state/).

After listing the PDF files in a folder the paths are sent to a Python Script that will handle the extraction itself. The results will be written to .parquet files for further use.

Loop thru all PDF pages and extract all tables found into .parquet files
Loop thru all PDF pages and extract all tables found into .parquet files (https://hub.knime.com/-/spaces/-/~pmkrtx1AKm7Fcukw/current-state/).

The two settings most relevant for this use of the Camelot package are Flavor. You will most likely have to test which one works best:

  • ‘lattice’ (for PDFs with clear grid lines)
  • ‘stream’ (for PDFs where text is positioned close together)

In this case we use lattice and the package identifies two tables in our invoice. The format and the headers might not be exactly there yet but they can easily be adjusted later:

The data stored in tables — to be processed later
The data stored in tables — to be processed later (https://hub.knime.com/-/spaces/-/~pmkrtx1AKm7Fcukw/current-state/).

If you want to know more about automation options and KNIME you can read: “KNIME, Paths and Loops — Automate Everything

Extract a special Area of the PDF

The next approach would be to use PyMuPDF/fitz to (always) identify a special area in the PDF — for example a header that would contain the addresses You most likely will have to experiment with the setting of the x and y coordinates that will define the area. You can do this for multiple pages at once — in this case we extract sender and receiver addresses from pages 1 and 2 (that is 0 and 1 in the logic of Python).

Define an area in the PDF that you want to extract
Define an area in the PDF that you want to extract (https://hub.knime.com/-/spaces/-/~pmkrtx1AKm7Fcukw/current-state/).

You could later work with these information and bring the data into a proper form. One idea could be to feed such texts into a Large Language Model and ask to get back a JSON file with the structured results. Or use some RegEx to prepare your data — or maybe identify key words like “Receiver” from where to continue.

The extracted text from the area will be stored in a table
The extracted text from the area will be stored in a table.

Just get all the Text from the PDF and deal with it later

The third option presented is to just extract all the text per page and stored in a table. That can be similar to using the Tika Parser node in KNIME.

Just extract all the text in the PDF — and deal with it later
Just extract all the text in the PDF — and deal with it later.

Another approach using “pdfplumber”

You can also try to employ pdfplumber and just extract all the tables it can find and store them in .parquet files and deal with the formats later. You might have to identify which are the genuine tables you want and leave out the other ones. Some loops might come in handy to do this.

KNIME workflow using “pdfplumber” to extract and invoice from a PDF
Use “pdfplumber” to extract and invoice from a PDF (https://forum.knime.com/t/pdf-parsing-and-extraction/80343/6?u=mlauber71)

Now you have the basics of how to quickly extract information from your PDF files. In a real life scenario you will have to so some more configuration and testing. The packages have much more options. You can always ask ChatGPT also for example to give you back JSON files:

There were more examples on the KNIME Forum and Hub about how to deal with content from PDF:

In case you enjoyed this story you can follow me on Medium (https://medium.com/@mlxl) or on the KNIME Hub (https://hub.knime.com/mlauber71) or KNIME Forum (https://forum.knime.com/u/mlauber71/summary).

This is the basic Python setup in a YML file that you would need. You can read about how to manage KNIME and Python here: Setting up and managing Conda environments.

# conda env create -f="/Users/m_lauber/Dropbox/knime-workspace-50/Examples/PDF - Python package Camelot to extract Text and Tables/data/data/py3_knime.yml"
# conda env create -f="C:\\Users\\x123456\\knime-workspace\\forum\\2023\\kn_forum_70050_pdf_table_extract_indy_car\\data\\py3_knime.yml"

# conda remove -n py3_knime --all

# conda activate py3_knime
# conda update -n py3_knime --all

# conda env update --name py3_knime --file "/Users/m_lauber/Dropbox/knime-workspace-50/Examples/PDF - Python package Camelot to extract Text and Tables/data/data/py3_knime.yml" --prune
# conda env update --name py3_knime --file "C:\\Users\\x123456\\knime-workspace\\forum\\2023\\kn_forum_70050_pdf_table_extract_indy_car\\data\\py3_knime.yml" --prune

# conda env update --name py3_knime --file "/Users/m_lauber/Dropbox/knime-workspace-50/Examples/PDF - Python package Camelot to extract Text and Tables/data/data/py3_knime.yml"
# conda env update --name py3_knime --file "C:\\Users\\x123456\\knime-workspace\\forum\\2023\\kn_forum_70050_pdf_table_extract_indy_car\\data\\py3_knime.yml"
# conda update -n base conda

# KNIME official Python integration guide
# https://docs.knime.com/latest/python_installation_guide/index.html#_introduction

# KNIME and Python - Setting up and managing Conda environments
# https://medium.com/p/2ac217792539

# conda activate py3_knime

# conda install -n py3_knime -c conda-forge camelot-py
# conda install -n py3_knime -c conda-forge tabula-py
# conda install -n py3_knime -c conda-forge pdfplumber
# conda install -n py3_knime -c conda-forge pymupdf

# file: py3_knime.yml with some modifications
# THX Carsten Haubold (https://hub.knime.com/carstenhaubold) for hints
name: py3_knime # Name of the created environment
channels: # Repositories to search for packages
# - defaults # edit: removed to just use conda-forge
# - anaconda # edit: removed to just use conda-forge
- conda-forge
# https://anaconda.org/knime
- knime # conda search knime-python-base -c knime --info # to see what is in the package
dependencies: # List of packages that should be installed
- python=3.9 # Python
- knime-python-base # dependencies of KNIME - Python integration
# - knime-python-scripting # everything you need to also build Python packages for KNIME
- cairo # SVG support
- pillow # Image inputs/outputs
- matplotlib # Plotting
- IPython # Notebook support
- nbformat # Notebook support
- scipy # Notebook support
- jpype1 # A Python to Java bridge
# Jupyter Notebook support
- jupyter # Jupyter Notebook
- pandas-profiling # create overview of your data
- sweetviz # In-depth EDA (target analysis, comparison, feature analysis, correlation) in two lines of code!
- plotly
- pdfplumber # Plumb a PDF for detailed information about each text character, rectangle, and line.
- camelot-py # Camelot: PDF Table Extraction for Humans https://pypi.org/project/camelot-py/
- pip # Python installer
- pip:
- pymupdf # https://pypi.org/project/PyMuPDF/ PyMuPDF is a high performance Python library for data extraction, analysis, conversion & manipulation of PDF (and other) documents

--

--

Markus Lauber
Low Code for Data Science

Senior Data Scientist working with KNIME, Python, R and Big Data Systems in the telco industry