Visualizing Words

PCA and clustering in Python

Marcus Alder
5 min readFeb 22, 2020

In this post, I’ll show how to use a few NLP techniques to transform words into mathematical representations and plot them as points, as well as provide some examples. The graph below was created from the Star Wars wiki Wookieepedia and colored with a clustering algorithm.

3D Plot of Star Wars characters
Plot of characters, locations, and organizations from Star Wars

The words’ coordinates are created from word embeddings (word vectors) which are created based on the contexts each word appears in. The vectors have properties related to the words’ meanings, approximately satisfying equations like (vector for “Paris”) - (vector for “France”) + (vector for “Italy”) ≈ (vector for “Rome”)— i.e. you can take Paris, substitute France out for Italy, and you’ll get Rome. Clustering and plotting also reveals interesting patterns; if you’ve watched Star Wars you might notice the clustering algorithm has unknowingly separated people, places, and organizations.

All the code is available at github.com/LogicalShark/wordvec

Collecting Data

Find an appropriate corpus for your analysis, or for more general uses like finding analyzing countries or movies you could use a generic text corpus. Get enough data for get meaningful word embeddings, but just 50KB or less can be sufficient if it’s all relevant.

The dump will be XML, which you may want to preprocess

I analyzed proper names from franchises I personally like, and if you want to do the same I recommend checking the franchise’s Fandom wiki for a database dump at “whatever.fandom.com/wiki/Special:Statistics” (although sadly some don’t provide database dumps).

Generating Word Vectors

wvgen.py on github

To create the vectors we need the words they correspond to, which requires splitting the text into words. We can then use Word2Vec (a word vector creation model) to create the vectors. To get a list of words I used NLTK’s word-tokenizer, and for an Word2Vec implementation I used gensim. Here’s some more details on processing the text:

Memory Limitations: The input was too large to manipulate all at once, but since Word2Vec can take an iterator as input, I made a custom iterator which takes files one line at a time for tokenization.

Synonyms: There are some names written in multiple ways that we want represented in one vector, like Donkey Kong = DK or Obi-Wan Kenobi = Ben Kenobi. Instead of combining the output vectors, we can prevent the problem with string replacement on the input. For example, replacing all instances of “Donkey Kong” with “DK” means this character is represented by a single vector for the word “DK,” instead of two.

Multi-word Expressions: Sometimes we want one vector to represent multiple words (e.g. Han Solo, Peach’s Castle), but the words get split by tokenization and become separate vectors. I used NLTK’s multi-word expression tokenizer (MWEtokenizer), which lets you add these names as custom phrases to be re-concatenated after the output of the word tokenization. An alternative would string replacement again (replace “Han Solo” with “HanSolo”) but I found MWEtokenizer to be simpler.

Summary of the custom iterator

Summary: The iterator takes each line in the file, makes replacements for synonym consistency, performs word tokenization, and finally condenses multi-word expressions. Word2Vec takes the iterator as an argument in lieu of a word list and generates a model with the word vectors.

Graphing Word Vectors

wvplot.py on github

Words have greatly varied contexts, meaning the vectors have a lot of features. Instead of graphing on three features, we can use Principal Component Analysis (PCA) to calculate a linear combination of features providing the orthogonal axes with the greatest variance. To further visualize patterns, the point’s text colors are set with k-means++ clustering (using sklearn) which automatically creates k “clusters.” I used matplotlib for graphing, giving an interactive graph like this:

Plot of names in Super Mario
Plot of characters, locations, and games in the Super Mario franchise. The overlapping red points includes all the locations (e.g. Peach’s Castle, New Donk City) and some characters (e.g. Dry Bones, Hammer Bro)

The Super Mario Fandom wiki was used to generate the vectors. A k=3 clustering seems to create a “main character” cluster, a “location/secondary character” cluster, and a “game” cluster. Daisy, Waluigi and Toad are appropriately positioned between main and secondary characters. If Donkey Kong seems a bit closer to the game cluster than the other main characters, it’s because “Donkey Kong” is both a character and the original arcade game!

2D plot of characters in The Office
2D Plot example using characters from The Office (TV), k=4

Sometimes 2D plots are enough to show patterns. The x-axis positions and clusters in the above plot approximately correlate with the screen presence of each character, matching this chart:

Graph of characters from The Office by their number of lines of dialogue
Source and more data on this reddit post

Minor spoilers for Hollow Knight in second plot below!

3D plot of League of Legends champions
All League of Legends champions with k=5. Clusters are determined not only by features visible in the plot but also many unseen features, which is why the red cluster is not a clear, separate group

More Plots and Word Vector Arithmetic

wvarith.py on github

Locations and characters from the indie game Hollow Knight, k=3

For arithmetic, the function most_similar_cosmul will find approximate solutions to vector addition of each word in positive and subtraction of those in negative. I found it useful to put one more positive word than negative, for a net of one word vector. This includes a single word in positive and empty negative.

wvlinear.py uses this function to search for equations. However, unrelated words often combine by pure coincidence, so I recommend looking for relationships yourself with wvarith.py.

Thanks for reading! All the code and some sample models are available at github.com/LogicalShark/wordvec. Let me know if you have any questions!

--

--

Marcus Alder

Software engineer on Google’s Kubernetes API team, recent CMU graduate for CS, Linguistics, and Game Design