Extract Images from Word Documents with Python

Alice Yang
6 min readMar 28, 2024

--

Extract Images from Word Documents with Python

Images are often an integral part of Word documents, providing visual context and enhancing the overall presentation. Extracting these images from Word documents can be beneficial for various purposes. Whether you want to use them in another document, share them separately, or incorporate them into a presentation or design project, knowing how to extract images from a Word document is a valuable skill. In this article, we will explore how to extract images from Word Doc or Docx documents using Python.

We will discuss the following topics:

Python Library to Extract Images from Word Documents

To extract images from Word Doc or Docx documents with Python, we can use the Spire.Doc for Python library.

Spire.Doc for Python is a feature-rich and easy-to-use library for creating, reading, editing, and converting Word files within Python applications. With this library, you can work with a wide range of Word formats, including Doc, Docx, Docm, Dot, Dotx, Dotm, and more. Moreover, you are also able to render Word documents to other types of file formats, such as PDF, RTF, HTML, Text, Image, SVG, ODT, PostScript, PCL, and XPS.

You can install Spire.Doc for Python from PyPI by running the following command in your terminal:

pip install Spire.Doc

For more detailed information about the installation, you can check this official documentation: How to Install Spire.Doc for Python in VS Code.

Extract Images from a Word Document with Python

Extracting images from a Word document makes it easier to share them separately with others. You can send the images as standalone files, allowing recipients to view and utilize the visuals without needing the entire Word document.

In Spire.Doc for Python, images in a Word document are represented by DocPicture objects. To extract images from a Word document with Spire.Doc for Python, you need to traverse through the document elements and identify the objects of the DocPicture type.

Here is a simple example that demonstrates how to extract images from a Word document using Python and Spire.Doc for Python:

import queue
from spire.doc import *
from spire.doc.common import *
import os

# Specify the input file path
input_file = "Template.docx"
# Specify the output directory path
output_path = "DocumentImages/"
# Create the output directory if it doesn't exist
os.makedirs(output_path, exist_ok=True)

# Create a Document instance
document = Document()
# Load the input Word document
document.LoadFromFile(input_file)

# Create a list to store the extracted image data
images = []

# Initialize a queue to store document elements for traversal
nodes = queue.Queue()
nodes.put(document)

# Traverse through the document elements
while not nodes.empty():
node = nodes.get()
for i in range(node.ChildObjects.Count):
obj = node.ChildObjects[i]
# Find the images
if isinstance(obj, DocPicture):
picture = obj
# Append the image data to the list
data_bytes = picture.ImageBytes
images.append(data_bytes)
elif isinstance(obj, ICompositeObject):
nodes.put(obj)

# Save the image data to image files
for i, image_data in enumerate(images):
file_name = f"Image-{i}.png"
with open(os.path.join(output_path, file_name), 'wb') as image_file:
image_file.write(image_data)

document.Close()

Extract Images from a Specific Page in a Word Document with Python

Word documents can span multiple pages, and sometimes, you may only need to extract images from a particular page or set of pages.

Here is a simple example that demonstrates how to extract images from a specific page in a Word document using Python and Spire.Doc for Python:

from spire.doc import *
from spire.doc.common import *

# Specify the input file path
input_file = "Template.docx"
# Specify the output directory path
output_path = "PageImages/"
# Create the output directory if it doesn't exist
os.makedirs(output_path, exist_ok=True)

# Create a Document instance
document = Document()
# Load the input Word document
document.LoadFromFile(input_file)

# Create a list to store the extracted image data
images = []

# Create an object of the FixedLayoutDocument class and pass the Document object to the class constructor as a parameter
# This step is to transfer the document into a fixed layout document
layoutDoc = FixedLayoutDocument(document)

# Access the first page of the document
layoutPage = layoutDoc.Pages[0]

# Traverse through each row in the first column of the first page
for i in range(layoutPage.Columns[0].Lines.Count):
line = layoutPage.Columns[0].Lines[i]
# Get the paragraph of the current line
paragraph = line.Paragraph

# Traverse through the child objects in the paragraph
for j in range(paragraph.ChildObjects.Count):
obj = paragraph.ChildObjects[j]
# Find the images
if isinstance(obj, DocPicture):
picture = obj
# Append the image data to the list
data_bytes = picture.ImageBytes
images.append(data_bytes)

# Save the image data to image files
for i, image_data in enumerate(images):
file_name = f"Image-{i}.png"
with open(os.path.join(output_path, file_name), 'wb') as image_file:
image_file.write(image_data)

document.Close()

Extract Images from a Specific Table in a Word Document with Python

Tables are a common organizational tool in Word documents, often used to present data, statistics, or other structured information. In some cases, these tables may contain embedded images, such as logos, charts, or illustrations. Being able to extract images from a specific table can be incredibly useful when you need to repurpose or separate these visual elements from the tabular data.

Here is a simple example that demonstrates how to extract images from a specific table in a Word document using Python and Spire.Doc for Python:

from spire.doc import *
from spire.doc.common import *

# Specify the input file path
input_file = "AddImagesToTable.docx"
# Specify the output directory path
output_path = "TableImages/"
# Create the output directory if it doesn't exist
os.makedirs(output_path, exist_ok=True)

# Create a Document instance
document = Document()
# Load the input Word document
document.LoadFromFile(input_file)

# Get the first section
section = document.Sections[0]

# Get the first table in the section
table = section.Tables[0]

# Create a list to store the extracted image data
images = []

# Traverse through all rows in the table
for i in range(table.Rows.Count):
row = table.Rows[i]
# Traverse through all cells in each row
for j in range(row.Cells.Count):
cell = row.Cells[j]
# Traverse through all paragraphs in each cell
for k in range(cell.Paragraphs.Count):
paragraph = cell.Paragraphs[k]
# Traverse through all child objects in each paragraph
for o in range(paragraph.ChildObjects.Count):
obj = paragraph.ChildObjects[o]
# Find the images
if isinstance(obj, DocPicture):
picture = obj
# Append the image data to the list
data_bytes = picture.ImageBytes
images.append(data_bytes)

# Save the image data to image files
for i, image_data in enumerate(images):
file_name = f"Image-{i}.png"
with open(os.path.join(output_path, file_name), 'wb') as image_file:
image_file.write(image_data)

document.Close()

Extract Images from Header and Footer in a Word Document with Python

Headers and footers in Word documents often contain important visual elements such as logos, graphics, or branding information. Extracting these images from the header and footer sections can be incredibly useful when you need to separate or repurpose these visual elements.

Here is a simple example that demonstrates how to extract images from the header and footer in a Word document using Python and Spire.Doc for Python:

from spire.doc import *
from spire.doc.common import *

# Specify the input file path
input_file = "HeaderAndFooter.docx"
# Specify the output directory path
output_path = "HeaderFooterImages/"
# Create the output directory if it doesn't exist
os.makedirs(output_path, exist_ok=True)

# Create a Document instance
document = Document()
# Load the input Word document
document.LoadFromFile(input_file)

# Get the first section
section = document.Sections[0]

# Create a list to store the extracted image data
images = []

# Get the header
header = section.HeadersFooters.Header

# Traverse through all paragraphs in the header
for i in range(header.Paragraphs.Count):
paragraph = header.Paragraphs[i]
# Traverse through all child objects in each paragraph
for j in range(paragraph.ChildObjects.Count):
obj = paragraph.ChildObjects[j]
# Find the images
if isinstance(obj, DocPicture):
picture = obj
# Append the image data to the list
data_bytes = picture.ImageBytes
images.append(data_bytes)

# Get the footer
footer = section.HeadersFooters.Footer

# Traverse through all paragraphs in the footer
for i in range(footer.Paragraphs.Count):
paragraph = footer.Paragraphs[i]
# Traverse through all child objects in each paragraph
for j in range(paragraph.ChildObjects.Count):
obj = paragraph.ChildObjects[j]
# Find the images
if isinstance(obj, DocPicture):
picture = obj
# Append the image data to the list
data_bytes = picture.ImageBytes
images.append(data_bytes)

# Save the image data to image files
for i, image_data in enumerate(images):
file_name = f"Image-{i}.png"
with open(os.path.join(output_path, file_name), 'wb') as image_file:
image_file.write(image_data)

document.Close()

Conclusion

This article demonstrated various scenarios for extracting images from Word documents using Python. We hope you find it helpful.

Related Topics

--

--

Alice Yang

Skilled senior software developers with five years of experience in all phases of software development life cycle using .NET, Java and C++ languages.