Extract Images from PDF Documents in Python

Alexander Stock
3 min readOct 12, 2023

--

Extracting images from a PDF file can be a useful and practical task in various situations. Whether you need to repurpose images for a presentation, create a digital photo album, or simply save images for future reference, the ability to extract images from a PDF can save you time and effort. In this article, you will learn how to extract images from a PDF document in Python using Spire.PDF for Python.

Install Dependency

This solution requires Spire.PDF for Python to be installed as the dependency, which is a Python library for reading, creating and manipulating PDF documents in a Python program. You can install it by running the following pip command.

pip install Spire.PDF

Extract Images from a Specific Page in Python

Spire.PDF for Python offers the PdfPageBase.ExtractImages() method to extract images from a specified page. The following are the detailed steps.

  • Create a PdfDocument object.
  • Load a PDF document using PdfDocument.LoadFromFile() method.
  • Get a particular page through PdfDocument.Pages[index] property.
  • Extract images from the page using PdfPageBase.ExtractImages() method and return a list of images.
  • Write each image in the list as a PNG file.
from spire.pdf.common import *
from spire.pdf import *

# Create a PdfDocument object
doc = PdfDocument()

# Load a PDF document
doc.LoadFromFile('C:/Users/Administrator/Desktop/input.pdf')

# Get a specific page
page = doc.Pages[1]

# Extract images from the page
images = []
for image in page.ExtractImages():
images.append(image)

# Save images to specified location with specified format extension
index = 0
for image in images:
imageFileName = 'C:/Users/Administrator/Desktop/Extracted/Image-{0:d}.png'.format(index)
index += 1
image.Save(imageFileName, ImageFormat.get_Png())
doc.Close()

Extract All Images from a PDF Document in Python

To extract all images from an entire PDF document, loop through the pages in the document and then retrieve the images from each page separately. Here are the detailed steps.

  • Create a PdfDocument object.
  • Load a PDF document using PdfDocument.LoadFromFile() method.
  • Iterate through the pages in the document, and get the images from each page using PdfPageBase.ExtractImages() method.
  • Write all extracted images as individual PNG files.
from spire.pdf.common import *
from spire.pdf import *

# Create a PdfDocument object
doc = PdfDocument()

# Load a PDF document
doc.LoadFromFile('C:/Users/Administrator/Desktop/input.pdf')

images = []

# Loop through the pages in the document
for i in range(doc.Pages.Count):
page = doc.Pages.get_Item(i)

# Extract images from a specific page
for image in page.ExtractImages():
images.append(image)

# Save images to specified location with specified format extension
index = 0
for image in images:
imageFileName = 'C:/Users/Administrator/Desktop/Extracted/Image-{0:d}.png'.format(index)
index += 1
image.Save(imageFileName, ImageFormat.get_Png())
doc.Close()

Conclusion

This blog post provides valuable insights into extracting images from PDF documents using Python. It covers two main techniques: extracting images from a specific page and extracting all images from a PDF document. These techniques offer practical solutions for developers and enthusiasts looking to manipulate and extract images from PDF files programmatically.

Related Topics

--

--

Alexander Stock

I'm Alexander Stock, a software development consultant and blogger with 10+ years' experience. Specializing in office document tools and knowledge introduction.