Your slide deck is a zip file in disguise

In 2007, Microsoft Office’s complex, proprietary binary format was replaced with an open, XML-based format, the familiar docx, xlsx and pptx. That day, sighs of relief from developers reverberated throughout the world.

Ever since that announcement, I assumed that a Powerpoint file was simply a single XML file containing all the information needed to define my slides. What I didn’t know was that it was instead a collection of many XML files, organized as a zip file!

Show me!

Want to see for yourself? Open a .PPTX file in your favorite text editor (here’s a sample file) and look at the first 4 bytes. A zip file typically starts with “50 4b 03 04” in hexadecimal notation (the first two bytes are “PK”, which stands for Phil Katz, who developed the zip format).

I need more proof

Since filename extensions are really just suggestions to your Operating System, rename your zip file from sample.pptx to sample.zip (you may need to enable displaying file extensions for your OS to let you do this).

Then, simply unzip your sample.zip file—usually, double clicking does the trick.

You should see something like this:

As you can see, not only does a PPTX not consist of a single XML file, it consists of a lot of XML files.

Neat. But how is this useful?

Before discarding this as a mere party trick, it’s worth thinking through why this can be powerful:

Use Case 1: Extracting all images from a slide deck

After unzipping your PowerPoint file, you may have noticed a folder called ppt/media/:

This folder contains all the images embedded into the PowerPoint presentation! So if you need to extract several pictures from a slide deck, this approach is much faster than Right-Click -> Save as Picture on every image. On top of that, this approach will give you higher quality images than Save as Picture ever will.

Use Case 2: Extracting notes from each slide

Another use of this trick would be extract all the notes in your PowerPoint presentation. As it turns out, slide notes are stored in a separate XML file in the folder ppt/notesSlides:

So we could write a Bash script to loop through each .xml file in that folder and extract the notes you wrote down on each slide (the code below makes use of the xmllint command-line utility):

function extractSlideNotes()
{
local i=0;
local DIR=${1:-.}; # Default is current folder
    # XML path to slide notes: <p:txBody> -> <a:p> --> <a:r> --> <a:t> (see notesSlide.xml files)
local XPATH="//*[local-name()='txBody']/*[local-name()='p']/*[local-name()='r']/*[local-name()='t']/text()";
    # Loop through each slide
for FILE in ${DIR}/ppt/notesSlides/*.xml;
do
(( i++ ));
        # Fetch and output the notes
local NOTES=$(xmllint --xpath ${XPATH} ${FILE});
echo -e ${i} "\t" ${NOTES};
done
}

After calling the extractSlideNotes function, your output will look something like this:

$ extractSlideNotes ~/Desktop/sample/
1 Notes on slide 1
2 Notes on slide 2
3 Notes on slide 3

You can imagine extending this to extract other kinds of information from a PowerPoint file, simply by changing the $XPATH variable to fetch a different XML path from the file.

What about adding slides?

If, like me, you often find yourself demoing your software tools in PowerPoint, you may be wondering whether we can automate inserting 100’s of new slides, where each slide would contain one screenshot obtained from a specified folder. Doing that manually can be painful (although in the Windows version of PowerPoint, it seems you can do Insert -> Photo Album).

As we’ll see here, although it is possible to insert slides programmatically, it is not nearly as easy as retrieving information about those slides.

For example, inserting a 4th slide requires two steps. First, we need to edit the following XML files to define our new slide and its relationship to the rest of the deck:

  • [Content_Types].xml
  • ppt/presentation.xml
  • ppt/_rels/presentation.xml.rels

You may also need to edit docProps/app.xml to update the number of slides in your deck along with the slide titles, although it seems to work without making that change.

Next, we need to create two new files:

  • ppt/slides/slide4.xml
  • ppt/slides/_rels/slide4.xml

These files will define the contents of the slide and refer to common elements it uses (e.g. reusing an image from a previous slide).

In other words, this is quite a bit more complex than fetching data!

In fact, it would probably be easier to use a Macro to automate this:

Sub Test()
Dim pptSlide As Slide, pptLayout As CustomLayout
Set pptLayout = ActivePresentation.Slides(1).CustomLayout

For Index = 1 To 5
slideID = ActivePresentation.Slides.Count + 1

' Create new slide
Set pptSlide = ActivePresentation.Slides.AddSlide(slideID, pptLayout)

' Add a new image to that slide from /Users/robert/Desktop/images/figure*.png
pptSlide.Shapes.AddPicture fileName:="Users:robert:Desktop:images:figure" & Index & ".png", LinkToFile:=msoTrue, SaveWithDocument:=msoTrue, Left:=0, Top:=100
Next
End Sub

It’s a wrap

This concludes our exploration into the wonders hiding behind the .pptx file. If you have ideas about how this trick can be used for other practical purposes, feel free to share them below.