Python-docx: A Comprehensive Guide to Creating and Manipulating Word Documents in Pythonđ
Document Automation in Python.
The DOCX format is a file format used for documents created by Microsoft Word, a popular word processing software. It is part of the Microsoft Office suite of applications. DOCX stands for âDocument (XML-based) Extension.â This format was introduced as an improvement over the older DOC format, which was based on a binary file structure.
docx Python library
The docx Python library is a popular tool used for working with Microsoft Word files in the .docx format. It allows you to create, modify, and extract information from Word documents programmatically using Python code.
The .docx format is based on the Office Open XML (OOXML) standard, which Microsoft introduced with Microsoft Office 2007. This format replaced the older .doc binary format used in earlier versions of Microsoft Word.
The python-docx library was not developed by Microsoft but was created as an open-source project by individuals in the Python community to provide a way to programmatically create and manipulate .docx files. The library abstracts the complexities of the OOXML format and provides a user-friendly Python interface for interacting with Word documents.
The initial version of the python-docx library was released around 2012 by developer Mike MacCana. Since then, the library has been maintained and improved by the open-source community, receiving contributions from various developers.
Usage
With the docx library, you can perform a variety of tasks, such as:
Creating New Documents: You can use the library to generate new Word documents from scratch or based on templates. This is useful for generating reports, letters, and other types of documents automatically.
Modifying Existing Documents: You can open existing Word documents and modify their content, formatting, styles, and more using the library. This is particularly handy for automating updates to documents that follow a specific structure.
Adding Content: You can add paragraphs, headings, tables, images, and other elements to a document using the library. This is helpful for dynamically populating documents with data.
Formatting: The library allows you to apply various formatting options to the text and elements within the document, such as changing fonts, colors, alignment, and more.
Extracting Information: You can also extract text, images, tables, and other content from existing Word documents for further analysis or processing.
Docx functions
These are just some of the task-wise functions and methods available in the python-docx library which are commonly used:
Document Creation and Saving
Document(): Create a new Word document.
Document.save(âfilename.docxâ): Save the document to a file.
Paragraphs and Text
add_paragraph(âtextâ): Add a new paragraph with the specified text.
paragraph.text: Get or set the text content of a paragraph.
Headings
add_heading(âtextâ, level=n): Add a heading with specified text and level (1 to 9).
Styles and Formatting
paragraph.style = âStyleNameâ: Apply a specific paragraph style.
run = paragraph.add_run(âtextâ): Add a run of text with specific formatting.
run.bold, run.italic, etc.: Apply formatting to a run.
Tables
add_table(rows, cols): Add a table with the specified number of rows and columns.
table.cell(row, col): Get a specific cell in the table.
cell.text: Get or set the text content of a cell.
table.rows, table.columns: Access rows and columns of the table.
Images
document.add_picture(âimage_pathâ): Add an image to the document.
run.add_picture(âimage_pathâ): Add an image to a specific run.
Document Properties
document.core_properties.title: Set the title of the document.
document.core_properties.author: Set the author of the document.
document.core_properties.keywords: Set keywords for the document.
Sections and Page Setup
section = document.sections[0]: Get the first section of the document.
section.page_width, section.page_height: Set page dimensions.
Lists
add_paragraph(âtextâ, style=âListBulletâ): Create a bulleted list.
add_paragraph(âtextâ, style=âListNumberâ): Create a numbered list.
Hyperlinks
run.add_hyperlink(âurlâ, âtextâ): Add a hyperlink to a run.
Document Modification
document.paragraphs: Access all paragraphs in the document.
document.tables: Access all tables in the document.
document.styles: Access and manipulate document styles.
Document Reading
Document(âfilename.docxâ): Open an existing Word document.
document.paragraphs[0].text: Access the text of a paragraph.
Installation
To install the python-docx library, you can use the pip package manager. Open your command prompt or terminal and run the following command:
pip install python-docx
Examples
Example of creating a simple document
In this example, we will create a document that includes text, headings, tables, images, and formatting. In this script, we will perform various actions using the python-docx library:
- Create a new document.
- Add a title with centered alignment.
- Add a paragraph with bold and italic text.
- Add a heading and a bulleted list.
- Add a table with custom column widths.
- Add an image to the document.
- Save the document with the name âexample_document.docxâ.
from docx import Document
from docx.shared import Pt
from docx.enum.text import WD_ALIGN_PARAGRAPH
# Create a new document
doc = Document()
# Add a title
title = doc.add_heading('Document Creation Example', level=1)
title.alignment = WD_ALIGN_PARAGRAPH.CENTER
# Add a paragraph with bold and italic text
paragraph = doc.add_paragraph('This is a sample document created using the python-docx library.')
run = paragraph.runs[0]
run.bold = True
run.italic = True
# Add a heading
doc.add_heading('Section 1: Introduction', level=2)
# Add a bulleted list
list_paragraph = doc.add_paragraph()
list_paragraph.add_run('Bullet 1').bold = True
list_paragraph.add_run(' - This is the first bullet point.')
list_paragraph.add_run('\n')
list_paragraph.add_run('Bullet 2').bold = True
list_paragraph.add_run(' - This is the second bullet point.')
# Add a table
doc.add_heading('Section 2: Data', level=2)
table = doc.add_table(rows=3, cols=3)
table.style = 'Table Grid'
table.autofit = False
table.allow_autofit = False
for row in table.rows:
for cell in row.cells:
cell.width = Pt(100)
table.cell(0, 0).text = 'Name'
table.cell(0, 1).text = 'Age'
table.cell(0, 2).text = 'City'
for i, data in enumerate([('Alice', '25', 'New York'), ('Bob', '30', 'San Francisco'), ('Charlie', '22', 'Los Angeles')], start=1):
table.cell(i, 0).text = data[0]
table.cel(i, 1).text = data[1]
table.cell(i, 2).text = data[2]
# Add an image
doc.add_heading('Section 3: Image', level=2)
doc.add_paragraph('Here is an image:')
doc.add_picture('path_to_your_image.jpg', width=Pt(300))
# Save the document
doc.save('example_document.docx')
Example of modifying a document
An example that demonstrates various functions and features of the python-docx library to modify an existing Word document.
In this script:
- We open an existing Word document (âexisting_document.docxâ).
- Modify the text, formatting, and alignment of the first paragraph.
- Add a new heading.
- Add a new paragraph with a hyperlink.
- Add a new table with custom column widths and data.
- Save the modified document as âmodified_document.docxâ.
from docx import Document
from docx.shared import Pt
from docx.enum.text import WD_ALIGN_PARAGRAPH
# Open an existing document
doc = Document('existing_document.docx')
# Access the first paragraph and modify its text and formatting
first_paragraph = doc.paragraphs[0]
first_paragraph.text = 'Updated Text'
run = first_paragraph.runs[0]
run.bold = True
run.italic = True
run.font.size = Pt(16)
first_paragraph.alignment = WD_ALIGN_PARAGRAPH.CENTER
# Add a new heading
doc.add_heading('New Section', level=1)
# Add a new paragraph with a hyperlink
new_paragraph = doc.add_paragraph('Visit our website: ')
run = new_paragraph.add_run('www.example.com')
run.hyperlink.address = 'https://www.example.com'
# Add a new table
doc.add_heading('Table Section', level=2)
table = doc.add_table(rows=4, cols=4)
table.style = 'Table Grid'
table.autofit = False
table.allow_autofit = False
for row in table.rows:
for cell in row.cells:
cell.width = Pt(100)
table.cell(0, 0).text = 'Name'
table.cell(0, 1).text = 'Age'
table.cell(0, 2).text = 'City'
for i, data in enumerate([('David', '28', 'London'), ('Emma', '35', 'New York'), ('John', '22', 'Los Angeles')], start=1):
table.cell(i, 0).text = data[0]
table.cell(i, 1).text = data[1]
table.cell(i, 2).text = data[2]
# Save the modified document
doc.save('modified_document.docx')
â â â
What did the docx document say to the PDF?
âYouâre so flat, you canât even format properly!â
đđđ