PDF-Python
asposepdf
Published in
3 min readNov 21, 2023

--

How to Extract Text from PDF using Aspose.PDF for Python via .NET

PDFs are crucial in our digital world, simplifying information sharing across platforms. Yet, users encounter numerous challenges daily while interacting with these files. From basic to intricate tasks, they shape the PDF user experience.

One major hurdle for users is text extraction and editing. Despite PDFs’ static nature, the need to extract or modify text is common.

Python offers various libraries (like PDFMiner) to tackle this issue. Today, let’s delve into Aspose.PDF for Python via .NET. Although .NET-based, it’s self-contained with all necessary components included. To install, use this command:

pip install aspose-pdf

Note! Get a temporary license and try to work with text without any limitations.

Retrieve Text from PDF Document

Let’s start with the basics! This Python code snippet demonstrates text extraction from a PDF using the Aspose.PDF library. Here’s a breakdown of the steps:

  • Import the required module from the Aspose.PDF library.
  • Load the PDF file (“input.pdf”) with the Document class, storing it in the pdfDocument variable.
  • Create a TextAbsorber object to extract text from the PDF.
  • Utilize textAbsorber to access the specified page using textAbsorber.visit(pdfDocument.pages[1]). In this case, it extracts text from the first page.
  • Finally, print the extracted text content or perform other actions.
import aspose.pdf as pdf
pdfDocument = pdf.Document("input.pdf")
textAbsorber = pdf.text.TextAbsorber()
textAbsorber.visit(pdfDocument.pages[1])
print(textAbsorber.text)

As observed, TextAbsorber enables us to extract all text from a page. However, for a more comprehensive analysis, let’s explore alternative tools.

Extracting Specific Text Segments from PDF Documents

Extracting text fragments using TextFragmentAbsorber allows retrieval of small text segments. By looping through these fragments, you can access properties such as Text and Position (XIndent, YIndent).

The extraction process involves the following steps:

  • Import the necessary module.
  • Load the PDF document.
  • Initialize a TextFragmentAbsorber.
  • Process a specific page: Use textFragmentAbsorber.visit(pdfDocument.pages[1]) to process the text on the first page of the PDF.
  • Iterate through text fragments: Use a for loop to iterate through each detected text fragment by the TextFragmentAbsorber.
  • Print or execute actions: Utilize print(textFragment.text) to display the content of each extracted text fragment or perform desired actions.
import aspose.pdf as pdf
pdfDocument = pdf.Document("input.pdf")
textFragmentAbsorber = pdf.text.TextFragmentAbsorber()
textFragmentAbsorber.visit(pdfDocument.pages[1])
for textFragment in textFragmentAbsorber.text_fragments:
print(textFragment.text)

This code performs a neat trick: it extracts text from a specific PDF page using the Aspose.PDF library. A handy method to grab text for future analysis or experimentation in Python!

Extracting Text Paragraphs from PDF Documents

ParagraphAbsorber, similar to prior tools, aids in managing text as paragraphs within its unique collection.

While this post doesn’t delve deeply into ParagraphAbsorber, here’s a concise example and overview:

  • Import the necessary module and load the PDF document.
  • Initialize a ParagraphAbsorber and process a specific page to obtain a collection of text sections.
  • Iterate through text sections: The outer loop covers each text section detected by the ParagraphAbsorber. The inner for-loop concatenates text fragments within a section, forming complete paragraphs.
import aspose.pdf as pdf
pdfDocument = pdf.Document("input.pdf")
paragraphAbsorber = pdf.text.ParagraphAbsorber()
paragraphAbsorber.visit(pdfDocument.pages[1])
for section in paragraphAbsorber.page_markups:
paragraphText=""
for textFragment in section.text_fragments:
paragraphText=paragraphText+textFragment.text
paragraphText=paragraphText+"\r\n"
print(paragraphText)

Essentially, this code snippet assists in extracting paragraphs from a selected PDF page. It’s a useful method to capture text for various purposes, such as analysis or any other processing needs you might have.

--

--

PDF-Python
asposepdf

Aspose.PDF for Python empowers developers to enrich PDFs with tables, graphs, images, and hyperlinks while ensuring security and compression.