5 Ways to Split Word Documents in Python

9 min readAug 21, 2024

Why Split a Word Document?

Splitting a Word document can be beneficial for several reasons:

Organization: Large documents can become unwieldy. Splitting them into sections or chapters can make navigation and editing easier.
Collaboration: When multiple people are working on a document, splitting it allows different team members to work on separate sections concurrently.
File Management: Smaller files are easier to manage, share, and archive.
Performance: Large documents can slow down Word’s performance. Dividing them can help maintain optimal performance.

In this article, we will explore how to split a Word document into multiple documents using Python. We will discuss the following topics:

Split a Word Document by Sections in Python
Split a Word Document by Headings in Python
Split a Word Document by Bookmarks in Python
Split a Word Document by Page Breaks in Python
Split a Word Document into HTML Pages in Python

Python Library for Splitting Word Documents

To split Word documents using Python, we’ll utilize the Spire.Doc for Python module. This tool offers a comprehensive set of features for working with Word documents, including creating, reading, editing, converting, merging, and splitting them.

To start, ensure you have Spire.Doc for Python installed. If not, you can install it via pip:

pip install spire.doc

Split a Word Document by Sections in Python

Sections in Word divide a document into distinct parts, each with its own headers, footers, page orientations, margins, and other formatting options. Splitting a Word document by sections allows you to create separate files for each part, making it easier to navigate, edit, and collaborate on specific sections without affecting the entire document.

Key Steps to Split a Word Document by Sections

Open the Source Document: Initialize a Document instance and load the source Word document you want to split.
Iterate through the Sections: Iterate through all sections in the source document. For Each Section:
- Create a New Document: create a new document for the section.
- Copy the Section: Copy the section from the source document and add the copied section to the new document.
- Save the Document: Save the new document to a separate file.

Here is a simple example of how to split a Word document by sections using Python and Spire.Doc for Python:

from spire.doc import *
from spire.doc.common import *

# Load the source document
with Document() as document:
    document.LoadFromFile("Sections.docx")

    # Iterate through all sections in the document
    for sec_index in range(document.Sections.Count):
        # Access the current section
        section = document.Sections[sec_index]

        # Create a new document for the current section
        with Document() as new_document:
            # Add a clone of the current section to the new document
            new_document.Sections.Add(section.Clone())

            # Copy themes and styles from the source document to ensure consistency
            document.CloneThemesTo(new_document)
            document.CloneDefaultStyleTo(new_document)

            # Save the new document with a unique filename for each section
            output_file = f"Output/Section{sec_index + 1}.docx"
            new_document.SaveToFile(output_file, FileFormat.Docx2016)

Split a Word Document by Headings in Python

Another common method for splitting a Word document is by using headings. This approach divides the document into separate files based on specified heading styles (e.g., Heading1).

Key Steps to Split a Word Document by Headings

Open the Source Document: Initialize a Document instance and load the source Word document you want to split.
Iterate through the Sections: Iterate through all sections in the source document. For Each Section:
- Identify Headings: Look for paragraphs styled as “Heading1”.
- Create a New Document: When the Heading1 is found, create a new document and copy the heading into it.
- Copy Content: Continue copying content into the new document until the next Heading1 is found.
- Save the Document: Save the new document to a separate file.

Here is a simple example of how to split a Word document by headings (Heading 1) using Python and Spire.Doc for Python:

from spire.doc import *
from spire.doc.common import *

# Load the source document
with Document() as source_document:
    source_document.LoadFromFile("Headings.docx")

    # Initialize variables
    new_documents = []
    new_document = None
    new_section = None
    is_inside_heading = False

    # Iterate through all sections in the source document
    for sec_index in range(source_document.Sections.Count):
        # Access the current section
        section = source_document.Sections[sec_index]

        # Iterate through all objects in the current section
        for obj_index in range(section.Body.ChildObjects.Count):
            # Access the current section
            obj = section.Body.ChildObjects[obj_index]
            # Check if the current object is a paragraph
            if isinstance(obj, Paragraph):
                para = obj
                # Check if the paragraph style is "Heading1"
                if para.StyleName == "Heading1":
                    # Add the document to the list if it exists
                    if new_document is not None:
                        new_documents.append(new_document)

                    # Create a new document 
                    new_document = Document()
                    # Add a new section to the new document
                    new_section = new_document.AddSection()

                    # Copy section settings
                    section.CloneSectionPropertiesTo(new_section)
                    # Copy the paragraph to the new section of the new document
                    new_section.Body.ChildObjects.Add(para.Clone())

                    # Set the is_inside_heading flag to True
                    is_inside_heading = True
                else:
                    if is_inside_heading:
                        # Copy the paragraph to the new section of the new document until the next Heading1
                        new_section.Body.ChildObjects.Add(para.Clone())
            else:
                if is_inside_heading:
                    # Copy non-paragraph objects to the new section
                    new_section.Body.ChildObjects.Add(obj.Clone())

    # Add the last document to the list if it exists
    if new_document is not None:
        new_documents.append(new_document)

    # Iterate through all documents in the list
    for i, doc in enumerate(new_documents):
        # Copy themes and styles from the source document to ensure consistency
        source_document.CloneThemesTo(doc)
        source_document.CloneDefaultStyleTo(doc)

        # Save the document to a separate file
        output_file = f"Output/HeadingContent{i + 1}.docx"
        doc.SaveToFile(output_file, FileFormat.Docx2016)

Split a Word Document by Bookmarks in Python

Bookmarks are placeholders within a document that indicate specific locations or sections, making them ideal for customized content management. By splitting a document at these bookmarked points, you can create separate files tailored to your needs, such as isolating distinct chapters, sections, or segments for easier navigation, distribution, or editing.

Key Steps to Split a Word Document by Bookmarks

Open the Source Document: Initialize a Document instance and load the source Word document you want to split.
Iterate through the Bookmarks: Iterate through all bookmarks in the source document. For each bookmark:
- Create a New Document: Create a new document for each bookmark.
- Add a New Section: Add a new section to the new document.
- Copy Bookmark Content: Use BookmarksNavigator objects to get the content of the current bookmark and add it to the new document.
- Save the New Document: Save the new document to a separate file.

Here is a simple example of how to split a Word document by bookmarks using Python and Spire.Doc for Python:

from spire.doc import *
from spire.doc.common import *

# Load the source document
with Document() as document:
    document.LoadFromFile("Bookmarks.docx")

    # Iterate through all bookmarks in the document
    for bookmark_index in range(document.Bookmarks.Count):
        # Access the current bookmark
        bookmark = document.Bookmarks[bookmark_index]

        # Create a new document for the current bookmark
        with Document() as new_document:
            # Add a new section to the new document
            new_section = new_document.AddSection()

            # Copy section settings
            document.Sections[0].CloneSectionPropertiesTo(new_section)

            # Create a bookmark navigator for the source document
            bookmarks_navigator = BookmarksNavigator(document)
            # Navigate to the current bookmark
            bookmarks_navigator.MoveToBookmark(bookmark.Name)
            # Get the bookmark content
            textBodyPart = bookmarks_navigator.GetBookmarkContent()

            # Add a paragraph to the new document
            paragraph = new_section.AddParagraph()
            # Add a bookmark to the paragraph with the same bookmark name
            paragraph.AppendBookmarkStart(bookmark.Name)
            paragraph.AppendBookmarkEnd(bookmark.Name)

            # Create a bookmark navigator for the new document
            new_bookmarks_navigator = BookmarksNavigator(new_document)
            # Navigate to the newly added bookmark
            new_bookmarks_navigator.MoveToBookmark(bookmark.Name)
            # Replace the content of the newly added bookmark in the new document with the content of the current bookmark in the source document
            new_bookmarks_navigator.ReplaceBookmarkContent(textBodyPart, True)

            # Copy themes and styles from the source document to ensure consistency
            document.CloneThemesTo(new_document)
            document.CloneDefaultStyleTo(new_document)

            # Save the new document to a separate file
            output_file = f"Output/Bookmark{bookmark_index + 1}.docx"
            new_document.SaveToFile(output_file, FileFormat.Docx2016)

Split a Word Document by Page Breaks in Python

In Microsoft Word, a page break marks the end of one page and the start of a new one within a document. Before you can split a Word document by page breaks, you need to insert these breaks where you want the divisions to occur.

Steps to Split a Word Document by Page Breaks

Open the Source Document: Initialize a Document instance and load the source Word document.
Create a New Document: Set up a new document and add an initial section.
Iterate through the Sections: Iterate through all sections in the source document. For each section:
- Identify Page Breaks: Look for page breaks in paragraphs of the section.
- Split at Page Breaks: When a page break is found in a paragraph, save the current document and create a new one. Copy the content after the page break to the new document.
Save the New Document: Save the new document to a separate file.

Here is a simple example of how to split a Word document by page breaks using Python and Spire.Doc for Python:

from spire.doc import *
from spire.doc.common import *

# Load the source document
with Document() as document:
    document.LoadFromFile("PageBreaks.docx")

    # Create a new document
    new_document = Document()
    # Add a new section to the new document
    new_section = new_document.AddSection()
    # Copy themes and styles from the source document to ensure consistency
    document.CloneDefaultStyleTo(new_document)
    document.CloneThemesTo(new_document)

    index = 0
    # Iterate through all sections in the source document
    for sec_index in range(document.Sections.Count):
        section = document.Sections[sec_index]
        # Iterate through all body child objects of each section
        for sec_obj_index in range(section.Body.ChildObjects.Count):
            sec_obj = section.Body.ChildObjects[sec_obj_index]

            # Check if the current object is a paragraph
            if isinstance(sec_obj, Paragraph):
                para = sec_obj
                # Copy section setting
                section.CloneSectionPropertiesTo(new_section)
                # Add a clone of the paragraph to the section of the new document
                new_section.Body.ChildObjects.Add(para.Clone())

                # Iterate through all body child objects of the paragraph
                for para_obj_index in range(para.ChildObjects.Count):
                    para_obj = para.ChildObjects[para_obj_index]

                    # Check if the current object is a page break
                    if isinstance(para_obj, Break) and para_obj.BreakType == BreakType.PageBreak:
                        # Get the index of page break in paragraph
                        i = para.ChildObjects.IndexOf(para_obj)
                        # Remove the page break from its paragraph
                        new_section.Body.LastParagraph.ChildObjects.RemoveAt(i)
                        # Save the document
                        output_file = f"Output/SplitDocByPageBreak-{index}.docx"
                        new_document.SaveToFile(output_file, FileFormat.Docx)
                        index += 1

                        # Create a new document
                        new_document = Document()
                        # Add a section to the new document
                        new_section = new_document.AddSection()
                        document.CloneDefaultStyleTo(new_document)
                        document.CloneThemesTo(new_document)
                        section.CloneSectionPropertiesTo(new_section)
                        # Add the paragraph to the section of the new document
                        new_section.Body.ChildObjects.Add(para.Clone())

                        if new_section.Paragraphs[0].ChildObjects.Count == 0:
                            # Remove the first paragraph if it's blank
                            new_section.Body.ChildObjects.RemoveAt(0)
                        else:
                            # Remove the child objects before the page break
                            while i >= 0:
                                new_section.Paragraphs[0].ChildObjects.RemoveAt(i)
                                i -= 1
            
            else:
                # Copy non-paragraph objects to the new section
                new_section.Body.ChildObjects.Add(sec_obj.Clone())

    # Save the document
    result = f"Output/SplitDocByPageBreak-{index}.docx"
    new_document.SaveToFile(result, FileFormat.Docx2013)

Split a Word Document into HTML Pages in Python

When splitting a Word document into HTML pages, you convert the content of the Word document into HTML format and divide it into separate web pages. This process enables the document to be viewed as a series of web pages on a browser.

Key Steps to Split a Word Document into HTML Pages

Open the Source Document: Initialize a Document instance and load the source Word document you want to split.
Iterate through the Sections: Iterate through all sections in the source document. For Each Section:
- Create a New Document: create a new document for the section.
- Copy the Section: Copy the section from the source document and add the copied section to the new document.
- Embed CSS and Images: Configure the new document’s HTML export options to embed CSS styles and images directly into the HTML page.
- Save the Document to HTML: Save the new document to a separate HTML file.

Here is a simple example of how to split each section of a Word document into a separate HTML page using Python and Spire.Doc for Python:

from spire.doc import *
from spire.doc.common import *

# Load the source document
with Document() as document:
    document.LoadFromFile("Sections.docx")
    
    # Iterate through all sections in the document
    for sec_index in range(document.Sections.Count):
        # Access the current section
        section = document.Sections[sec_index]
        
        # Create a new document for the current section
        new_document = Document()
        # Add a clone of the current section to the new document
        new_document.Sections.Add(section.Clone())
 
        # Copy themes and styles from the source document to ensure consistency
        document.CloneThemesTo(new_document)
        document.CloneDefaultStyleTo(new_document)
            
        # Embed CSS style and image data into HTML page
        new_document.HtmlExportOptions.CssStyleSheetType = CssStyleSheetType.Internal
        new_document.HtmlExportOptions.ImageEmbedded = True
            
        # Save the new document as an HTML file
        output_file = f"Output/Section-{sec_index + 1}.html"
        new_document.SaveToFile(output_file, FileFormat.Html)

        new_document.Close()

In addition to splitting the content of a Word document into HTML pages, you can also split them into many other formats such as PDF, XPS, Markdown, and more by adjusting the FileFormat parameter.

Conclusion

This article demonstrated 5 different ways to split a Word document using Python. We hope you find it helpful.