The anatomy of a Microsoft Word file

3 min readJun 5, 2020

The anatomy of a Microsoft Word file

For many years, Microsoft Word has used a format for its files, DOC. This changed with Word version 2007 where the DOCX file format was first used.

The DOCX format is XML based and not compatible with earlier formats. The “X” in DOCX stands for XML.

Now we are now going to dive into a DOCX file and see what is inside:

The first thing to know is, that a .docx file is nothing more than a .zip archive.

In order to demonstrate that this is true, take a .docx file, rename it to .zip and extract it.

You get an archive with the following content.

In the _rels directories, the dependencies between the individual parts are stored in a separate file with the extension .rels for each part. Such dependencies could be e.g. an embedded image.

The docProps directory contains properties such as the author or the date of the last modification.

The basic document data lies within the word folder, and the most interesting part of it is the file document.xml.

This file contains the text of the Word document.

A word document is composed of paragraphs <w:p> and tables <w:tblPr>.

The formatting can be declared directly or indirectly by reference to a style.

The paragraph formatting is done within a <w:pPr>.

<w:p>
<w:pPr>
<w:pStyle> w:val="Normal"/>
<w:spacing w:before="120" w:after="120"/>
</w:pPr>
<w:r>
<w:t xml"space="preserve">This is my text...</w:t>
</w:r>
</w:p>

The content of the paragraph is contained in one or more runs (<w:r>). Runs define text areas. Like paragraphs, they consist of formatting/property definitions followed by content. Formatting is specified within a <w:rPr> and can be direct formatting, indirect formatting via a style reference, or both.

The content of a run consists mainly of text elements (<w:t>)

Below you can see an example of a very simple paragaph.

<w:p>
<w:pPr>
<w:jc w:val="center">
<w:pPr>
<w:r>
<w:rPr>
<w:b/>
</w:rPr>
<w:t>This is a text.</w:t>
</w:r>
</w:p>

The docx file format does not define pages — only paragraphs and text. However, it allows you to store certain information that is important for page composition, such as page size, page orientation, and margins. This is done by using the section. A section is a grouping of paragraphs that have a specific set of properties that define the pages on which the text is to appear.

The properties of a section are stored in a <sectPr> element.

<w:p>
<w:pPr>
<w:sectPr>
<w:pgSz w:w="10240" w:h="13840"/>
</w:sectPr>
</w:pPr>
</w:p>
<w:p>

Bookmarks mark a defined part of the document and are actually legacy from the time of .doc. They can begin and end anywhere within a document and would therefore violate the XML form if they were displayed with typical XML start- and end- tags. For this reason, a bookmark is not defined between start- and end- tags. Instead, the beginning is defined by an empty element (BookmarkStart) and the end by an empty element (BookmarkEnd). The two tags together define a region by their common id attribute.

<w:bookmarkStart w:id="0" w:name="myBookmark"/>
<w:r>
<w:t>bookmark</w:t>
</w:r>
<w:bookmarkEnd w:id="0"/>

There are many more possible elements withhin a document. See http://officeopenxml.com/index.php if you would like to know more.

You could modify the xml files, create an new .zip file and rename it to .docx to see the results in Word. But I do not recommend messing around with the xml. However, this knowledge can be useful for you to see what’s going on within a Word doc.

Written by Christian Regli