Processing PPTX Files

Published in

Social Impact Analytics

3 min readApr 29, 2021

Extracting text from PowerPoint files for Natural Language Processing Analysis

From business organizations to school presentations, information is delivered and shared via PowerPoint slide deck presentations. Rich with language, these decks can play a vital role in any natural language processing analysis. Slide deck analysis can be crucial in understanding topics, key words and themes that the creator of the presentation finds important, or wants to emphasize to the presentation’s intended audience.

The PPTX Library

The pptx library has a lot of functionality for extracting text from pptx files. There may be use cases for extracting specific elements from a presentation for analysis, but for most cases the fastest and most efficient way to gather information from a PowerPoint with this library is to use the slide shape and text_frame methods from the library to return all text detected.

The function below takes in a list of pptx files and returns a dictionary where each file name is the dictionary key and a value of all text detected from each slide of that side show.

Function taking in pptx files and returning a data dictionary

This function can then be applied to your specific list of PowerPoint files with the following lines of code

Transforming & Cleaning the Data

Once all of your PowerPoint text data has been stored in a dictionary, that dictionary can be transformed into a dataframe with the file name, original text and cleaned text columns. This allows for easy comparison between clean and original text and easy reference to your data via file name.

Note — the red line indicates that the code string is longer than the recommended 80 characters by ASCII. As a coder it’s always important to consider readability when programming and whether a longer line or multi lines will lead to better readability. In this case a longer line of code was deemed better from a readability standpoint.

The above code applied basic cleaning to the text, and additional cleaning to remove other escape codes or text specific errors may be needed. To add those in, follow the format outlined above using the lambda function to apply changes to each row’s text within the data frame.

Ensuring UTF-8 Encoding

One of the most common additional cleaning steps you may need to take is to ensure that your text data is set to UTF-8 Encoding. Applying the following loop to your dataframe will ensure that all data is formatted in UTF-8 and compatible with python packages.

Saving Your Dataframe to CSV

Finally you can save your data frame to a CSV to be examined in Sheets/Excel or referenced later in a different working notebook.

And there you have it! A working data frame of text from PowerPoint slide decks ready for Natural Language Processing analysis!

Processing PPTX Files

The PPTX Library

Transforming & Cleaning the Data

Ensuring UTF-8 Encoding

Saving Your Dataframe to CSV

Written by Kristen Davis