Extract Images and Image Information from PDF with Python

4 min readJul 8, 2024

PDF (Portable Document Format) files are widely used for document sharing and preservation due to their versatility and consistent formatting. Beyond textual content, PDFs often contain a wealth of valuable images. Extracting these images and retrieving their associated information, such as position (x and y coordinates), width, and height, can unlock a myriad of possibilities for image analysis, manipulation, and integration into various projects. In this blog post, we will explore how to extract images and image information from PDF files using Python.

Extract Images from PDF with Python
Extract Image Information from PDF with Python

Python Library to Extract Images and Image Information from PDF

To extract images and image information from PDF files in Python, we will use Spire.PDF for Python. It is a feature-rich and user-friendly library designed to create, read, edit, and convert PDF files within Python applications.

You can install Spire.PDF for Python from PyPI using the following pip command:

pip install Spire.Pdf

If you already have Spire.PDF for Python installed and would like to upgrade to the latest version, use the following pip command:

pip install --upgrade Spire.Pdf

For more detailed information about the installation, you can check this official documentation: How to Install Spire.PDF for Python in VS Code.

Extract Images from PDF with Python

The PdfImageHelper class in Spire.PDF for Python provides a convenient way to deal with images in PDFs.

To get the images in a PDF, you can use the PdfImageHelper.GetImagesInfo(page: PdfPageBase) function. This will return a list of PdfImageInfo objects, each representing an image on a PDF page. Once you have the PdfImageInfo objects, you can use the PdfImageInfo.Image.Save() function to save each image to a file.

The code below demonstrates how to extract images from a PDF file using Python and Spire.PDF for Python:

from spire.pdf.common import *
from spire.pdf import *

def extract_images_from_pdf(pdf_path, output_dir):
    """
    Extracts all images from a PDF file and saves them to the specified output directory.
    
    Args:
        pdf_path (str): The path to the PDF file.
        output_dir (str): The directory where the extracted images will be saved.
    """
    # Create a PdfDocument object and load the PDF file
    doc = PdfDocument()
    doc.LoadFromFile(pdf_path)

    # Create a PdfImageHelper object
    image_helper = PdfImageHelper()

    image_count = 1
    # Iterate over all pages in the PDF
    for page_index in range(doc.Pages.Count):
        # Get the image information for the current page
        image_infos = image_helper.GetImagesInfo(doc.Pages[page_index])

        # Extract and save the images
        for image_index in range(len(image_infos)):
            # Get the image
            image = image_infos[image_index].Image
            # Specify the output file name
            output_file = os.path.join(output_dir, f"Image-{image_count}.png")
            # Save the image
            image.Save(output_file)
            image_count += 1

    # Close the PdfDocument object
    doc.Close()

# Example usage
extract_images_from_pdf("Sample.pdf", "C:/Users/Administrator/Desktop/Images")

Extract Image Information from PDF with Python

To extract image information, such as position (x and y coordinates), width, and height from a PDF, you can use the PdfImageInfo.Bounds.X, PdfImageInfo.Bounds.Y, PdfImageInfo.Bounds.Width and PdfImageInfo.Bounds.Height properties.

The code below demonstrates how to extract image information such as position (x and y coordinates), width, and height from a PDF file using Python and Spire.PDF for Python:

from spire.pdf.common import *
from spire.pdf import *

def print_pdf_image_info(pdf_path):
    """
    Prints information about the images in a PDF file.
    
    Args:
        pdf_path (str): The path to the PDF file.
    """
    # # Create a PdfDocument object and load the PDF file
    doc = PdfDocument()
    doc.LoadFromFile(pdf_path)

    # Create a PdfImageHelper object
    image_helper = PdfImageHelper()

    # Iterate over all pages in the PDF
    for page_index in range(doc.Pages.Count):
        page = doc.Pages[page_index]

        # Get the image information for the current page
        image_infos = image_helper.GetImagesInfo(page)

        # Print the image information
        for image_index, image_info in enumerate(image_infos):
            print(f"Page {page_index + 1}, Image {image_index + 1}:")
            print(f"  Image position: ({image_info.Bounds.X}, {image_info.Bounds.Y})")
            print(f"  Image size: {image_info.Bounds.Width} x {image_info.Bounds.Height}")

    # Close the PdfDocument object
    doc.Close()

# Example usage
print_pdf_image_info("Sample.pdf")

Extract Image Information from PDF with Python

Conclusion

This blog post demonstrated how to extract images from PDF files using Python. In addition, it also explained how to extract accompanying details of images from PDF files, such as their positions (x and y coordinates), widths, and heights, using Python.

Extract Images and Image Information from PDF with Python

Python Library to Extract Images and Image Information from PDF

Extract Images from PDF with Python

Extract Image Information from PDF with Python

Conclusion

Related Topics

Written by Alice Yang