Data visualizations at their best tell a story with the data that is both compelling to look at and easy to digest. The Scattertext tool created by Jason Kessler (check out the original documentation here) does just that. This guide will show you how to implement Scattertext with your data and bring the visual WOW to your work.
Pss! If you don’t need help formatting your data skip down to “Turning your text to a Scattertext Corpus”!
Scatter text allows you to visualize unique terms in your corpora and how their frequency differs from one category to another. In the example above I am comparing terms used in an educational standards document developed by the NGSS vs the educational standards document developed by the state of Alaska but any text with two categories you wish to compare will do.
Formatting Your Text Data
For this project, I was working with PDFs that, in their raw form, were not suitable for modeling. In order to quickly format the text data, I created a custom function that took in a file, lowered the words, tokenized the words, removed stop words and punctuation, and returning a list of cleaned words.
Building a Clean DataFrame
Using the custom-built function, now read in the PDF and build a data frame. I’ve added a few additional steps to clean my data such as setting the text column to a string in order to strip the unwanted brackets that appeared when I created the data frame from the list.
Turning Your Text into a Scattertext Corpus
Building the Corpus
You’ll need to have the spacy and scatter text libraries. A guide to installing the spacy library can be found here and the scatterplot library here. Once you have them installed the first thing you will need to do is build a scatter text corpus.
Breaking down each variable inside the parentheses starting on line 6 you have, the name of the data frame you are using, then the name of the column with the categorical information, and finally the name of the column with your text data.
Identify Corpus Unique Words
Once you have your corpus built you can find words that are highly unique to your corpus quickly using a get_scaled_f_scores_vs_background method.
Identify Words Most Aligned to a Category
You can also use term frequency to identify words most associated with a given category in your corpus.
Heads Up: You need to run this word frequency count for BOTH your categories before attempting to graph the ScatterText Visualization!
Build an Interactive Scattertext Visualization
Finally, the moment you’ve all been waiting for! If you have not run HTML in your Jupyter Notebook before make sure you have the Ipython.core.display import display & HTML as shown below.
Breaking down each variable inside the parentheses starting on line 3 you have your spacy corpus, the category you wish to graph (note it is the name of the column value not the column name), how you want the category to appear on your visual, how you want the name of the opposite category to appear in your visual and finally your pixel count.
This may take a minute or two to run and will return a set of numbers within the Jupyter Notebook when complete. It will also save the visual to the specified final. In order to access the visual go to the folder, you created the visualization in and single-click the link.
Depending on the size of your visualization it may take anywhere from several seconds to many minutes for it to load. When it does, boom! There you go! A beautiful interactive text visualization tool!
Bonus: Host Your Visualization
One final note: what’s the good of an interactive tool that lives only inside your notebook? Use a site like tiiny.host to deploy your scattertext visualization and host it live with a clickable sharable link!