5 Ways to Split Word Documents in Python

Alice Yang
9 min readAug 21, 2024

--

Split Word Document in Python
Split Word Document in Python

Why Split a Word Document?

Splitting a Word document can be beneficial for several reasons:

  • Organization: Large documents can become unwieldy. Splitting them into sections or chapters can make navigation and editing easier.
  • Collaboration: When multiple people are working on a document, splitting it allows different team members to work on separate sections concurrently.
  • File Management: Smaller files are easier to manage, share, and archive.
  • Performance: Large documents can slow down Word’s performance. Dividing them can help maintain optimal performance.

In this article, we will explore how to split a Word document into multiple documents using Python. We will discuss the following topics:

Python Library for Splitting Word Documents

To split Word documents using Python, we’ll utilize the Spire.Doc for Python module. This tool offers a comprehensive set of features for working with Word documents, including creating, reading, editing, converting, merging, and splitting them.

To start, ensure you have Spire.Doc for Python installed. If not, you can install it via pip:

pip install spire.doc

Split a Word Document by Sections in Python

Sections in Word divide a document into distinct parts, each with its own headers, footers, page orientations, margins, and other formatting options. Splitting a Word document by sections allows you to create separate files for each part, making it easier to navigate, edit, and collaborate on specific sections without affecting the entire document.

Key Steps to Split a Word Document by Sections

  • Open the Source Document: Initialize a Document instance and load the source Word document you want to split.
  • Iterate through the Sections: Iterate through all sections in the source document. For Each Section:
    - Create a New Document: create a new document for the section.
    - Copy the Section: Copy the section from the source document and add the copied section to the new document.
    - Save the Document: Save the new document to a separate file.

Here is a simple example of how to split a Word document by sections using Python and Spire.Doc for Python:

from spire.doc import *
from spire.doc.common import *

# Load the source document
with Document() as document:
document.LoadFromFile("Sections.docx")

# Iterate through all sections in the document
for sec_index in range(document.Sections.Count):
# Access the current section
section = document.Sections[sec_index]

# Create a new document for the current section
with Document() as new_document:
# Add a clone of the current section to the new document
new_document.Sections.Add(section.Clone())

# Copy themes and styles from the source document to ensure consistency
document.CloneThemesTo(new_document)
document.CloneDefaultStyleTo(new_document)

# Save the new document with a unique filename for each section
output_file = f"Output/Section{sec_index + 1}.docx"
new_document.SaveToFile(output_file, FileFormat.Docx2016)

Split a Word Document by Headings in Python

Another common method for splitting a Word document is by using headings. This approach divides the document into separate files based on specified heading styles (e.g., Heading1).

Key Steps to Split a Word Document by Headings

  • Open the Source Document: Initialize a Document instance and load the source Word document you want to split.
  • Iterate through the Sections: Iterate through all sections in the source document. For Each Section:
    - Identify Headings: Look for paragraphs styled as “Heading1”.
    - Create a New Document: When the Heading1 is found, create a new document and copy the heading into it.
    - Copy Content: Continue copying content into the new document until the next Heading1 is found.
    - Save the Document: Save the new document to a separate file.

Here is a simple example of how to split a Word document by headings (Heading 1) using Python and Spire.Doc for Python:

from spire.doc import *
from spire.doc.common import *

# Load the source document
with Document() as source_document:
source_document.LoadFromFile("Headings.docx")

# Initialize variables
new_documents = []
new_document = None
new_section = None
is_inside_heading = False

# Iterate through all sections in the source document
for sec_index in range(source_document.Sections.Count):
# Access the current section
section = source_document.Sections[sec_index]

# Iterate through all objects in the current section
for obj_index in range(section.Body.ChildObjects.Count):
# Access the current section
obj = section.Body.ChildObjects[obj_index]
# Check if the current object is a paragraph
if isinstance(obj, Paragraph):
para = obj
# Check if the paragraph style is "Heading1"
if para.StyleName == "Heading1":
# Add the document to the list if it exists
if new_document is not None:
new_documents.append(new_document)

# Create a new document
new_document = Document()
# Add a new section to the new document
new_section = new_document.AddSection()

# Copy section settings
section.CloneSectionPropertiesTo(new_section)
# Copy the paragraph to the new section of the new document
new_section.Body.ChildObjects.Add(para.Clone())

# Set the is_inside_heading flag to True
is_inside_heading = True
else:
if is_inside_heading:
# Copy the paragraph to the new section of the new document until the next Heading1
new_section.Body.ChildObjects.Add(para.Clone())
else:
if is_inside_heading:
# Copy non-paragraph objects to the new section
new_section.Body.ChildObjects.Add(obj.Clone())

# Add the last document to the list if it exists
if new_document is not None:
new_documents.append(new_document)

# Iterate through all documents in the list
for i, doc in enumerate(new_documents):
# Copy themes and styles from the source document to ensure consistency
source_document.CloneThemesTo(doc)
source_document.CloneDefaultStyleTo(doc)

# Save the document to a separate file
output_file = f"Output/HeadingContent{i + 1}.docx"
doc.SaveToFile(output_file, FileFormat.Docx2016)

Split a Word Document by Bookmarks in Python

Bookmarks are placeholders within a document that indicate specific locations or sections, making them ideal for customized content management. By splitting a document at these bookmarked points, you can create separate files tailored to your needs, such as isolating distinct chapters, sections, or segments for easier navigation, distribution, or editing.

Key Steps to Split a Word Document by Bookmarks

  • Open the Source Document: Initialize a Document instance and load the source Word document you want to split.
  • Iterate through the Bookmarks: Iterate through all bookmarks in the source document. For each bookmark:
    - Create a New Document: Create a new document for each bookmark.
    - Add a New Section: Add a new section to the new document.
    - Copy Bookmark Content: Use BookmarksNavigator objects to get the content of the current bookmark and add it to the new document.
    - Save the New Document: Save the new document to a separate file.

Here is a simple example of how to split a Word document by bookmarks using Python and Spire.Doc for Python:

from spire.doc import *
from spire.doc.common import *

# Load the source document
with Document() as document:
document.LoadFromFile("Bookmarks.docx")

# Iterate through all bookmarks in the document
for bookmark_index in range(document.Bookmarks.Count):
# Access the current bookmark
bookmark = document.Bookmarks[bookmark_index]

# Create a new document for the current bookmark
with Document() as new_document:
# Add a new section to the new document
new_section = new_document.AddSection()

# Copy section settings
document.Sections[0].CloneSectionPropertiesTo(new_section)

# Create a bookmark navigator for the source document
bookmarks_navigator = BookmarksNavigator(document)
# Navigate to the current bookmark
bookmarks_navigator.MoveToBookmark(bookmark.Name)
# Get the bookmark content
textBodyPart = bookmarks_navigator.GetBookmarkContent()

# Add a paragraph to the new document
paragraph = new_section.AddParagraph()
# Add a bookmark to the paragraph with the same bookmark name
paragraph.AppendBookmarkStart(bookmark.Name)
paragraph.AppendBookmarkEnd(bookmark.Name)

# Create a bookmark navigator for the new document
new_bookmarks_navigator = BookmarksNavigator(new_document)
# Navigate to the newly added bookmark
new_bookmarks_navigator.MoveToBookmark(bookmark.Name)
# Replace the content of the newly added bookmark in the new document with the content of the current bookmark in the source document
new_bookmarks_navigator.ReplaceBookmarkContent(textBodyPart, True)

# Copy themes and styles from the source document to ensure consistency
document.CloneThemesTo(new_document)
document.CloneDefaultStyleTo(new_document)

# Save the new document to a separate file
output_file = f"Output/Bookmark{bookmark_index + 1}.docx"
new_document.SaveToFile(output_file, FileFormat.Docx2016)

Split a Word Document by Page Breaks in Python

In Microsoft Word, a page break marks the end of one page and the start of a new one within a document. Before you can split a Word document by page breaks, you need to insert these breaks where you want the divisions to occur.

Steps to Split a Word Document by Page Breaks

  • Open the Source Document: Initialize a Document instance and load the source Word document.
  • Create a New Document: Set up a new document and add an initial section.
  • Iterate through the Sections: Iterate through all sections in the source document. For each section:
    - Identify Page Breaks: Look for page breaks in paragraphs of the section.
    - Split at Page Breaks: When a page break is found in a paragraph, save the current document and create a new one. Copy the content after the page break to the new document.
  • Save the New Document: Save the new document to a separate file.

Here is a simple example of how to split a Word document by page breaks using Python and Spire.Doc for Python:

from spire.doc import *
from spire.doc.common import *

# Load the source document
with Document() as document:
document.LoadFromFile("PageBreaks.docx")

# Create a new document
new_document = Document()
# Add a new section to the new document
new_section = new_document.AddSection()
# Copy themes and styles from the source document to ensure consistency
document.CloneDefaultStyleTo(new_document)
document.CloneThemesTo(new_document)

index = 0
# Iterate through all sections in the source document
for sec_index in range(document.Sections.Count):
section = document.Sections[sec_index]
# Iterate through all body child objects of each section
for sec_obj_index in range(section.Body.ChildObjects.Count):
sec_obj = section.Body.ChildObjects[sec_obj_index]

# Check if the current object is a paragraph
if isinstance(sec_obj, Paragraph):
para = sec_obj
# Copy section setting
section.CloneSectionPropertiesTo(new_section)
# Add a clone of the paragraph to the section of the new document
new_section.Body.ChildObjects.Add(para.Clone())

# Iterate through all body child objects of the paragraph
for para_obj_index in range(para.ChildObjects.Count):
para_obj = para.ChildObjects[para_obj_index]

# Check if the current object is a page break
if isinstance(para_obj, Break) and para_obj.BreakType == BreakType.PageBreak:
# Get the index of page break in paragraph
i = para.ChildObjects.IndexOf(para_obj)
# Remove the page break from its paragraph
new_section.Body.LastParagraph.ChildObjects.RemoveAt(i)
# Save the document
output_file = f"Output/SplitDocByPageBreak-{index}.docx"
new_document.SaveToFile(output_file, FileFormat.Docx)
index += 1

# Create a new document
new_document = Document()
# Add a section to the new document
new_section = new_document.AddSection()
document.CloneDefaultStyleTo(new_document)
document.CloneThemesTo(new_document)
section.CloneSectionPropertiesTo(new_section)
# Add the paragraph to the section of the new document
new_section.Body.ChildObjects.Add(para.Clone())

if new_section.Paragraphs[0].ChildObjects.Count == 0:
# Remove the first paragraph if it's blank
new_section.Body.ChildObjects.RemoveAt(0)
else:
# Remove the child objects before the page break
while i >= 0:
new_section.Paragraphs[0].ChildObjects.RemoveAt(i)
i -= 1

else:
# Copy non-paragraph objects to the new section
new_section.Body.ChildObjects.Add(sec_obj.Clone())

# Save the document
result = f"Output/SplitDocByPageBreak-{index}.docx"
new_document.SaveToFile(result, FileFormat.Docx2013)

Split a Word Document into HTML Pages in Python

When splitting a Word document into HTML pages, you convert the content of the Word document into HTML format and divide it into separate web pages. This process enables the document to be viewed as a series of web pages on a browser.

Key Steps to Split a Word Document into HTML Pages

  • Open the Source Document: Initialize a Document instance and load the source Word document you want to split.
  • Iterate through the Sections: Iterate through all sections in the source document. For Each Section:
    - Create a New Document: create a new document for the section.
    - Copy the Section: Copy the section from the source document and add the copied section to the new document.
    - Embed CSS and Images: Configure the new document’s HTML export options to embed CSS styles and images directly into the HTML page.
    - Save the Document to HTML: Save the new document to a separate HTML file.

Here is a simple example of how to split each section of a Word document into a separate HTML page using Python and Spire.Doc for Python:

from spire.doc import *
from spire.doc.common import *

# Load the source document
with Document() as document:
document.LoadFromFile("Sections.docx")

# Iterate through all sections in the document
for sec_index in range(document.Sections.Count):
# Access the current section
section = document.Sections[sec_index]

# Create a new document for the current section
new_document = Document()
# Add a clone of the current section to the new document
new_document.Sections.Add(section.Clone())

# Copy themes and styles from the source document to ensure consistency
document.CloneThemesTo(new_document)
document.CloneDefaultStyleTo(new_document)

# Embed CSS style and image data into HTML page
new_document.HtmlExportOptions.CssStyleSheetType = CssStyleSheetType.Internal
new_document.HtmlExportOptions.ImageEmbedded = True

# Save the new document as an HTML file
output_file = f"Output/Section-{sec_index + 1}.html"
new_document.SaveToFile(output_file, FileFormat.Html)

new_document.Close()

In addition to splitting the content of a Word document into HTML pages, you can also split them into many other formats such as PDF, XPS, Markdown, and more by adjusting the FileFormat parameter.

Conclusion

This article demonstrated 5 different ways to split a Word document using Python. We hope you find it helpful.

Related Topics

--

--

Alice Yang

Skilled senior software developers with five years of experience in all phases of software development life cycle using .NET, Java and C++ languages.