How to Extract Text from PDFs and Images for LLMs Use

Gaurav Garg
4 min readAug 22, 2023

--

Large language models like GPT-3 rely on vast amounts of text data for training. While there are many open datasets available, sometimes you may need to extract text from PDF documents or image files to customize and enrich your training data. Here are some techniques for extracting text from these non-structured data sources.

How to Extract Text from PDFs and Images for LLMs Use

Extracting Text from PDFs

PDF documents often contain large amounts of useful text data. However, PDFs store text in a formatted manner that is not directly machine-readable. We need to extract the raw text content from the PDF before feeding it to our language model. Here are two options for extracting text from PDFs.

Using PDF Parsing Libraries

Several Python libraries such as PyPDF2, pdfplumber, and pdfminer allow extracting text from PDFs. PyPDF2 provides a simple way to extract all text from a PDF.

Using PyPDF2 Library

import PyPDF2

pdfFile = open('document.pdf', 'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFile)

text = ""
for page in range(pdfReader.numPages):
pageObj = pdfReader.getPage(page)
text += pageObj.extractText()

Using pdfplumber Library

The pdfplumber library can extract text more cleanly by identifying text blocks:

import pdfplumber

with pdfplumber.open('document.pdf') as pdf:
pages = pdf.pages
for page in pages:
text = page.extract_text()

Using Google Cloud Vision API

Google Cloud Vision provides advanced OCR capability to extract text from scanned PDFs. First, we need to convert each page of the PDF to an image. Then the Vision API can detect text in each image:

from google.cloud import vision
import io
from PIL import Image

client = vision.ImageAnnotatorClient()

with open('scanned.pdf', 'rb') as pdf:
pages = convert_from_bytes(pdf.read())

full_text = ""
for page in pages:
image = vision.Image(content=page.tobytes)
response = client.document_text_detection(image=image)
full_text += response.text

The Vision API approach works well for scanned or image-based PDFs where parsing libraries may fail.

Extracting Text from Images

We can also extract text embedded in image files like JPEGs and PNGs using similar OCR techniques:

Using Google Cloud Vision API

The Cloud Vision API provides a simple text_detection method to extract text from images:

response = client.text_detection(image=image)
text = response.text

It can detect text in various sizes, fonts, and orientations.

Using OpenCV and Tesseract OCR

OpenCV can be used to detect text regions in an image. Then Tesseract OCR can extract text from those regions:

import pytesseract
import cv2

img = cv2.imread('image.jpg')

# Detect text regions
rects = detector(img)

# Extract text from regions
text = ""
for rect in rects:
x, y, w, h = rect
text += pytesseract.image_to_string(img[y:y+h, x:x+w])

This approach provides more control over the text detection and OCR process.

Preprocessing and Cleaning Extracted Text

The raw text extracted from PDFs and images often contains artifacts, irregular spacing, and other issues. Here are some tips for cleaning the text before feeding it to a language model:

  • Remove duplicate spaces, punctuation, and newline characters
  • Standardize whitespace between words
  • Fix common OCR errors through regex rules or spellcheck
  • Remove page numbers, headers, footers, and other repeated text
  • Split text into sentences or paragraphs

Proper preprocessing ensures higher quality text data for better language model training.

What are the most common use cases for extracting text from PDFs and images?

Some common use cases include:

  • Augmenting training data for machine learning models like large language models. The extracted text provides additional high-quality data to improve model performance.
  • Extracting text from scanned documents and book scans for natural language processing tasks like search indexing and metadata generation. This makes the information in the scans machine-readable.
  • Mining data from research papers, reports, news articles etc. published in PDF form for text analytics and knowledge extraction. The text can be analyzed to identify trends, insights etc.
  • Retrieving text from image-heavy presentations, magazines, posters etc. for better indexing and searchability. This makes the text content discoverable.
  • Converting image-based CAPTCHAs to text for automated bot detection and prevention. The text extracted from CAPTCHAs can be used for verification.

What are some challenges faced while extracting text from these sources?

Some common challenges are:

  • Scanner artifacts, creases, low resolution of scanned PDFs can impact OCR accuracy.
  • Multiple columns, figures, tables, and odd formatting in PDFs make text extraction difficult.
  • Non-standard fonts, sizes, colors, orientations, and noise in images reduce OCR performance.
  • Heavy use of symbols, math expressions, diagrams in academic papers pose extra challenges.
  • Redundant text like headers, footers and watermarks must be removed.
  • Spell-check and text normalization is required to correct OCR errors.
  • Extracted text needs significant preprocessing and cleaning before usage.

Conclusion

Extracting text from PDFs and images enables us to tap into a wealth of useful data for training large language models. Libraries like PyPDF2, pdfplumber, and Google Cloud Vision provide convenient ways to extract text. The extracted text still requires preprocessing and cleaning before it is ready for language model consumption. With some effort, PDFs and images can become valuable additions to your model training data.

If you’re ready to dive deeper, explore new perspectives, and join a community of passionate learners, I invite you to connect with me across various social media platforms.

MediumLinkedInTwitterFacebookInstagramWebsite

Did you relish this piece? If so, make that “Clap” icon dance to your clicks as if it’s the last day on Earth! Remember, each reader can tap into the applause up to 50 times!

Before you go here are some more articles for your interest:

Node.js vs PHP: Clash of the Titans or Harmonious Coexistence?

The C4 Model for Visualizing Software Architecture

--

--

Gaurav Garg

Entrepreneur, Thinker, Designer, Runner, SEO, Content Creator, writes on various Topics, Building something awesome ;)