Read or Extract Text from PDF with Python — A Comprehensive Guide
PDF documents such as research papers, legal documents, contracts, or reports often contain important textual information. By extracting the text from these documents, you can make the information accessible in a format that can be easily searched, copied, or modified for further analysis or reference. In this article, we will explore how to read or extract text from PDF documents using Python.
We’ll discuss the following topics:
- Extract Text from an Entire PDF in Python
- Extract Text from a Particular Page in PDF in Python
- Extract Text from a Rectangular Area of a Page in PDF in Python
- Extract Highlighted Text from a PDF in Python
Python Library for Text Extraction from PDF
To perform text extraction on PDF files with Python, we can use the Spire.PDF for Python library.
Spire.PDF for Python is a feature-rich and user-friendly library that enables creating, reading, editing, and converting PDF files within Python applications. With this library, you can perform a wide range of manipulations on PDFs, including adding text or images, extracting text or images, adding digital signatures, adding or deleting pages, merging or splitting PDFs, creating bookmarks, adding text or image watermarks, inserting fillable forms and many more. In addition, you are also able to convert PDF files to various file formats, such as Word, Excel, images, HTML, SVG, XPS, OFD, PCL, and PostScript.
Installing Spire.PDF for Python is incredibly easy. Just follow these simple steps:
- Open your project’s terminal.
- Execute this pip command:
pip install spire.pdf
Extract Text from an Entire PDF in Python
You can simply extract text from an entire PDF document by iterating through the pages in the document and then calling the PdfTextExtractor.ExtractText() function to extract text from every page of the PDF document.
Here is a simple example that shows how to extract text from an entire PDF document using Python and Spire.PDF for Python:
from spire.pdf.common import *
from spire.pdf import *
def extract_text_from_pdf(file_path, output_file):
# Load a PDF document
doc = PdfDocument()
doc.LoadFromFile(file_path)
extracted_text = []
# Iterate over the pages of the document
for i in range(doc.Pages.Count):
page = doc.Pages.get_Item(i)
# Extract the text from the page
textExtractor = PdfTextExtractor(page)
option = PdfTextExtractOptions()
text = textExtractor.ExtractText(option)
extracted_text.append(text)
# Save the extracted text to a text file
with open(output_file, "w", encoding="utf-8") as text_file:
text_file.write("\n".join(extracted_text))
doc.Close()
# Example usage
file_path = "Sample.pdf"
output_file = "DocumentText.txt"
extract_text_from_pdf(file_path, output_file)
Extract Text from a Particular Page in PDF in Python
To extract text from a particular page, you can access that page from the page collection of the document using PdfDocument.Pages[pageindex] property and then call the PdfTextExtractor.ExtractText() function to extract text from that page.
Here is a simple example that shows how to extract text from a particular page of a PDF document using Python and Spire.PDF for Python:
from spire.pdf.common import *
from spire.pdf import *
def extract_text_from_page(file_path, page_num, output_file):
# Load a PDF document
doc = PdfDocument()
doc.LoadFromFile(file_path)
# Get a specific page
# page_num starts from 0
page = doc.Pages[page_num]
# Extract the text from the page
textExtractor = PdfTextExtractor(page)
option = PdfTextExtractOptions()
text = textExtractor.ExtractText(option)
# Save the extracted text to a text file
with open(output_file, "w", encoding="utf-8") as text_file:
text_file.write(text)
doc.Close()
# Example usage
file_path = "Sample.pdf"
page_num = 0
output_file = "PageText.txt"
extract_text_from_page(file_path, page_num, output_file)
Extract Text from a Rectangular Area of a Page in PDF in Python
In addition to extracting text from an entire document or a particular page, you are also able to extract text from a rectangular area of a page.
Here is a simple example that shows how to extract text from a rectangular area of a page of a PDF document using Python and Spire.PDF for Python:
from spire.pdf.common import *
from spire.pdf import *
def extract_text_from_page_area(file_path, page_num, x, y, width, height, output_file):
# Load a PDF document
doc = PdfDocument()
doc.LoadFromFile(file_path)
# Get a specific page
# page_num starts from 0
page = doc.Pages[page_num]
# Define a rectangle to specify the page area for text extraction
rectangle = RectangleF(x, y, width, height)
# Extract the text from the specified rectangle area on the page
textExtractor = PdfTextExtractor(page)
option = PdfTextExtractOptions()
option.ExtractArea = rectangle
text = textExtractor.ExtractText(option)
# Save the extracted text to a text file
with open(output_file, "w", encoding="utf-8") as text_file:
text_file.write(text)
doc.Close()
# Example usage
file_path = "Sample.pdf"
page_num = 0
x = 0.0
y = 180.0
width = 500.0
height = 200.0
output_file = "PageAreaText.txt"
extract_text_from_page_area(file_path, page_num, x, y, width, height, output_file)
Extract Highlighted Text from a PDF in Python
When a section of text is highlighted in a PDF, a highlight annotation is created to represent that highlight. The annotation includes information about the position and extent of the highlighted text, as well as the appearance properties such as color and opacity.
To extract highlighted text from a PDF page, you need to find the highlight annotations on the page, then get the positions marked by the annotations (RectangleF objects), and finally call the PdfTextExtractor.ExtractText() function to get the text from the positions.
Here is a simple example that shows how to extract highlighted text on a page of a PDF document using Python and Spire.PDF for Python:
from spire.pdf.common import *
from spire.pdf import *
def extract_text_from_annotations(file_path, page_num, output_file):
# Load a PDF document
doc = PdfDocument()
doc.LoadFromFile(file_path)
# Get a specific page
# page_num starts from 0
page = doc.Pages[page_num]
extracted_text = []
# Get the annotations collection of the page
annotations = page.AnnotationsWidget
# Iterate over the annotations in the collection
for i in range(annotations.Count):
text_markup_annotation = annotations.get_Item(i)
# Check if the annotation is of type PdfTextMarkupAnnotationWidget
if isinstance(text_markup_annotation, PdfTextMarkupAnnotationWidget):
# Extract the text marked by the annotation
text_extractor = PdfTextExtractor(page)
options = PdfTextExtractOptions()
options.ExtractArea = text_markup_annotation.Bounds
text = text_extractor.ExtractText(options)
extracted_text.append(text)
# Save the extracted text to a text file
with open(output_file, "w", encoding="utf-8") as text_file:
text_file.write("\n".join(extracted_text))
doc.Close()
# Example usage
file_path = "Sample.pdf"
page_num = 0
output_file = "HighlightedText.txt"
extract_text_from_annotations(file_path, page_num, output_file)
Conclusion
This article demonstrated various scenarios to extract text from PDF documents using Python and Spire.PDF for Python. We hope you can find it helpful.