Processing PPTX Files
Extracting text from PowerPoint files for Natural Language Processing Analysis
From business organizations to school presentations, information is delivered and shared via PowerPoint slide deck presentations. Rich with language, these decks can play a vital role in any natural language processing analysis. Slide deck analysis can be crucial in understanding topics, key words and themes that the creator of the presentation finds important, or wants to emphasize to the presentation’s intended audience.
The PPTX Library
The pptx library has a lot of functionality for extracting text from pptx files. There may be use cases for extracting specific elements from a presentation for analysis, but for most cases the fastest and most efficient way to gather information from a PowerPoint with this library is to use the slide shape and text_frame methods from the library to return all text detected.
The function below takes in a list of pptx files and returns a dictionary where each file name is the dictionary key and a value of all text detected from each slide of that side show.
This function can then be applied to your specific list of PowerPoint files with the following lines of code
Transforming & Cleaning the Data
Once all of your PowerPoint text data has been stored in a dictionary, that dictionary can be transformed into a dataframe with the file name, original text and cleaned text columns. This allows for easy comparison between clean and original text and easy reference to your data via file name.
Note — the red line indicates that the code string is longer than the recommended 80 characters by ASCII. As a coder it’s always important to consider readability when programming and whether a longer line or multi lines will lead to better readability. In this case a longer line of code was deemed better from a readability standpoint.
The above code applied basic cleaning to the text, and additional cleaning to remove other escape codes or text specific errors may be needed. To add those in, follow the format outlined above using the lambda function to apply changes to each row’s text within the data frame.
Ensuring UTF-8 Encoding
One of the most common additional cleaning steps you may need to take is to ensure that your text data is set to UTF-8 Encoding. Applying the following loop to your dataframe will ensure that all data is formatted in UTF-8 and compatible with python packages.
Saving Your Dataframe to CSV
Finally you can save your data frame to a CSV to be examined in Sheets/Excel or referenced later in a different working notebook.
And there you have it! A working data frame of text from PowerPoint slide decks ready for Natural Language Processing analysis!