DATA STORIES | TEXT EXTRACTION | KNIME ANALYTICS PLATFORM
KNIME — Extract Text and Tables from PDF Files with Python in a Low-Code Environment
Whether you prefer a no-code or low-code extraction, KNIME has got it covered
On the KNIME Forum a frequently asked question is the extraction of data like tables and text from a PDF file. There has also been a “Just KNIME It!” challenge where you had to extract a table from a PDF with the help of KNIME nodes.
In this article, I would like to show some approaches how to extract tables as well as specific text from PDF files with the help of Python packages like Camelot and PyMuPDF/fitz — all within the comfort of your favourite low-code tool. We will use this dummy two page invoice:
The KNIME Analytics Platform will provide a wrapper to help integrate the extraction into the rest of the workflow. All examples and functions shown in this article can be found in this KNIME workflow on the Hub:
After listing the PDF files in a folder the paths are sent to a Python Script that will handle the extraction itself. The results will be written to .parquet files for further use.
The two settings most relevant for this use of the Camelot package are Flavor. You will most likely have to test which one works best:
- ‘lattice’ (for PDFs with clear grid lines)
- ‘stream’ (for PDFs where text is positioned close together)
In this case we use lattice and the package identifies two tables in our invoice. The format and the headers might not be exactly there yet but they can easily be adjusted later:
If you want to know more about automation options and KNIME you can read: “KNIME, Paths and Loops — Automate Everything”
Extract a special Area of the PDF
The next approach would be to use PyMuPDF/fitz to (always) identify a special area in the PDF — for example a header that would contain the addresses You most likely will have to experiment with the setting of the x and y coordinates that will define the area. You can do this for multiple pages at once — in this case we extract sender and receiver addresses from pages 1 and 2 (that is 0 and 1 in the logic of Python).
You could later work with these information and bring the data into a proper form. One idea could be to feed such texts into a Large Language Model and ask to get back a JSON file with the structured results. Or use some RegEx to prepare your data — or maybe identify key words like “Receiver” from where to continue.
Just get all the Text from the PDF and deal with it later
The third option presented is to just extract all the text per page and stored in a table. That can be similar to using the Tika Parser node in KNIME.
Another approach using “pdfplumber”
You can also try to employ pdfplumber and just extract all the tables it can find and store them in .parquet files and deal with the formats later. You might have to identify which are the genuine tables you want and leave out the other ones. Some loops might come in handy to do this.
Now you have the basics of how to quickly extract information from your PDF files. In a real life scenario you will have to so some more configuration and testing. The packages have much more options. You can always ask ChatGPT also for example to give you back JSON files:
There were more examples on the KNIME Forum and Hub about how to deal with content from PDF:
- Extract Table from PDF with the help of Python package “camelot” (an initial approach with a more complex handling of the resulting tables)
- use R and KNIME to extract text from PDF file — search for page where text appears
- Camelot For extraction of tables from PDF (a component by qtmi_)
- Extract Image and Text from a PDF file (Python Code)
In case you enjoyed this story you can follow me on Medium (https://medium.com/@mlxl) or on the KNIME Hub (https://hub.knime.com/mlauber71) or KNIME Forum (https://forum.knime.com/u/mlauber71/summary).
This is the basic Python setup in a YML file that you would need. You can read about how to manage KNIME and Python here: Setting up and managing Conda environments.
# conda env create -f="/Users/m_lauber/Dropbox/knime-workspace-50/Examples/PDF - Python package Camelot to extract Text and Tables/data/data/py3_knime.yml"
# conda env create -f="C:\\Users\\x123456\\knime-workspace\\forum\\2023\\kn_forum_70050_pdf_table_extract_indy_car\\data\\py3_knime.yml"
# conda remove -n py3_knime --all
# conda activate py3_knime
# conda update -n py3_knime --all
# conda env update --name py3_knime --file "/Users/m_lauber/Dropbox/knime-workspace-50/Examples/PDF - Python package Camelot to extract Text and Tables/data/data/py3_knime.yml" --prune
# conda env update --name py3_knime --file "C:\\Users\\x123456\\knime-workspace\\forum\\2023\\kn_forum_70050_pdf_table_extract_indy_car\\data\\py3_knime.yml" --prune
# conda env update --name py3_knime --file "/Users/m_lauber/Dropbox/knime-workspace-50/Examples/PDF - Python package Camelot to extract Text and Tables/data/data/py3_knime.yml"
# conda env update --name py3_knime --file "C:\\Users\\x123456\\knime-workspace\\forum\\2023\\kn_forum_70050_pdf_table_extract_indy_car\\data\\py3_knime.yml"
# conda update -n base conda
# KNIME official Python integration guide
# https://docs.knime.com/latest/python_installation_guide/index.html#_introduction
# KNIME and Python - Setting up and managing Conda environments
# https://medium.com/p/2ac217792539
# conda activate py3_knime
# conda install -n py3_knime -c conda-forge camelot-py
# conda install -n py3_knime -c conda-forge tabula-py
# conda install -n py3_knime -c conda-forge pdfplumber
# conda install -n py3_knime -c conda-forge pymupdf
# file: py3_knime.yml with some modifications
# THX Carsten Haubold (https://hub.knime.com/carstenhaubold) for hints
name: py3_knime # Name of the created environment
channels: # Repositories to search for packages
# - defaults # edit: removed to just use conda-forge
# - anaconda # edit: removed to just use conda-forge
- conda-forge
# https://anaconda.org/knime
- knime # conda search knime-python-base -c knime --info # to see what is in the package
dependencies: # List of packages that should be installed
- python=3.9 # Python
- knime-python-base # dependencies of KNIME - Python integration
# - knime-python-scripting # everything you need to also build Python packages for KNIME
- cairo # SVG support
- pillow # Image inputs/outputs
- matplotlib # Plotting
- IPython # Notebook support
- nbformat # Notebook support
- scipy # Notebook support
- jpype1 # A Python to Java bridge
# Jupyter Notebook support
- jupyter # Jupyter Notebook
- pandas-profiling # create overview of your data
- sweetviz # In-depth EDA (target analysis, comparison, feature analysis, correlation) in two lines of code!
- plotly
- pdfplumber # Plumb a PDF for detailed information about each text character, rectangle, and line.
- camelot-py # Camelot: PDF Table Extraction for Humans https://pypi.org/project/camelot-py/
- pip # Python installer
- pip:
- pymupdf # https://pypi.org/project/PyMuPDF/ PyMuPDF is a high performance Python library for data extraction, analysis, conversion & manipulation of PDF (and other) documents