Unlocking Text from PDFs with Python: My Exploration

Pasham Sathwik reddy
5 min readJun 17, 2024

Ever found yourself stuck trying to pull information from a pile of PDFs for a project?

I’ve been there too!

Imagine this: you are tasked with a text extraction from a pile of PDFs for crucial project work.

I thought just copying all the text from the PDF and pasting it into a text file would be a simple solution.

Sounds simple,right?

But here’s the catch: I quickly realized that the data wasn’t as structured as I’d hoped. Some information was missing due to various font formats used in the PDFs, and to make matters worse, text inside images was getting copied too. While this approach worked fine for simple PDFs, it turned into a complete mess with more complex ones.

Being a Python enthusiast, I decided to explore different libraries for data extraction from PDFs. That’s when I discovered that handling PDF data extraction is a complex task in the world of data processing.

After some research, I discovered several prominent libraries designed for PDF data extraction:

  • PyPDF2
  • Pikepdf
  • Pdfrw
  • Pdfminer
  • pdfminer.six
  • Camelot
  • Tabula-py
  • Pymupdf

With numerous options, I felt overwhelmed choosing the right library.

When selecting a library for PDF text extraction, several factors come into play, such as :

  • Document complexity
  • output format
  • ease of use
  • Accuracy
  • performance
  • community
  • features

Each library has its strengths and weaknesses.

Skip the tedious documentation dive; I’ve already tackled the research and laid out the results for you:

  • PyMuPDF: High-performance, image extraction, metadata retrieval,accuracy,Supports all type of docs,.
  • Pikepdf: PDF manipulation, high-performance, Python interface, accuracy, advanced features.
  • PyPDF2: Simplicity, basic tasks, easy-to-use,performance.
  • pdfminer.six: Flexibility, Python 2/3 support, complex structures,community support.
  • pdfrw: Supports PDF format, basic tasks, metadata retrieval, accuracy, and image extraction.
  • pdfplumber: Accurate text extraction, page layout analysis, and text conversion from PDF documents.
  • PDFMiner: Advanced control, complex structures, precise extraction,accuracy.
  • Camelot & Tabula-py: Accurate table extraction, structured data,ease of use.

For extensive feature comparison you can refer this link:

https://pymupdf.readthedocs.io/en/latest/about.html

Why I used PyMuPDF?

PyMuPDF is a high-performance Python library for data extraction, analysis, conversion & manipulation of PDF (and other) documents.

My requirements are

  • I want to extract text from PDFs, no matter their format.
  • Then, I want to change that text into different formats like HTML, JSON, dictionaries, etc., so I can analyze the data.

After careful comparison, it closely aligns with my requirements.

PyMuPDF can handle more types of documents compared to other libraries. It not only extracts text from PDFs but also provides diverse metadata in different formats.

Set up and installation:

Follow the below-given instructions to create a virtual environment and install the library

  1. Open your terminal or command prompt.
  2. Navigate to the directory where you want to create your Python virtual environment.
  3. Run the following command to create a virtual environment named “myenv”:
python -m venv myenv

4.Once the virtual environment is created, activate it. On Windows, you can activate it by running:

myenv\\Scripts\\activate

On Unix or MacOS, use:

source myenv/bin/activate

5.Install the package using pip

pip install PyMuPDF

Exploring Document and Page Attributes and Methods

After installation,to open the pdf document use the below command:

import fitz
doc = fitz.open(pdf_file_path)

Wait wait… Why do we use fitz when we install PyMuPDF

When you installed PyMuPDF using pip, you installed the library called fitz, which lets you work with PDFs in Python. So, when you import fitz, you're actually importing PyMuPDF to handle PDF documents.

Now you can explore various attributes and methods associated with the document and its pages.

Some of the common operations are :

# Get the number of pages in the document
num_pages = doc.page_count

# Get metadata (e.g., author, title, creation date)
metadata = doc.metadata

# Get the table of contents (TOC) as a list of tuples (title, page number)
toc = doc.get_toc()

# Load a specific page (e.g., first page)
page_num = 0 # Page numbers are zero-based
page = doc.load_page(page_num)

You have access to a wealth of Document and Page attributes and methods in PyMuPDF. If you want to learn more,do not forget to check out the following links:

Text Extraction

To extract the text of a single page.

import fitz
# Open the PDF document
with fitz.open("your_document.pdf") as doc
# Load the specific page (e.g., page 0 for the first page)
page_number = 0
page = doc.load_page(page_number)
# Extract text from the page
text = page.get_text()
# Print or use the extracted text
print(text)

And if you want to extract text from the entire document and save it to a file:

import fitz
# Open the PDF document
with fitz.open("your_document.pdf") as doc
# Initialize an empty string to store all the text
all_text = ""
# Loop through each page in the document
for page in doc:
# Extract text from the page and append it to the 'all_text' string
all_text += page.get_text()
v
# Add a delimiter indicating the start of a new page
delimiter = f"\\nPage {page_number}\\n{'-' * 20}\\n"

# Write all the text to a file
with open("output.txt", "w", encoding="utf-8") as file:
file.write(all_text)

Text Conversion Formats

Now that we’ve successfully extracted text from our PDF documents, let’s explore how we can convert this text into different formats like HTML, JSON, dictionaries, etc., to facilitate data analysis.

Plain text-extraction

text = page.get_text("text")

By default, this method extracts plain text from the page. It retrieves the text content without any additional formatting or markup.

Block Based-Extraction

blocks = page.get_text("blocks")

It gives output as a list of lines grouped as a block

Word Based-Extraction

words = page.get_text("words")

It gives output as a list of single words with bbox information.

HTML, XHTML, and XML Extraction

html_content = page.get_text("html")
xhtml_content = page.get_text("xhtml")
xml_content = page.get_text("xml")

These options enable you to extract text from the PDF page and format it directly into HTML, XHTML, or XML.

  • HTML: The extracted text will be in HTML format, suitable for direct display in a web browser as a webpage.
  • XHTML: Similar to HTML, but XHTML follows stricter XML rules, making it suitable for well-formed documents and compatibility with XML parsers.
  • XML: The extracted text will be in XML format, which can be useful for further processing or integration with other XML-based systems

Dictionary and JSON Extraction

dict_content = page.get_text("dict")
json_content = page.get_text("json")
  • These options enable you to extract text from the PDF page and format it directly into Dictionary and JSON
  • Both enable easy integration with other Python data structures or serialization for storage and exchange

Raw Dictionary and Raw JSON Extraction

raw_dict_content = page.get_text("rawdict")
raw_json_content = page.get_text("rawjson")
  • These options provide access to the raw text content in dictionary or JSON format without any additional processing. This can be useful for advanced text analysis or custom data manipulation.

When using these extraction methods, it’s important to consider the specific requirements of your project and choose the output format that best suits your needs

References:

--

--