# Word2Vec on Harry Potter

Jul 22, 2016 · 6 min read

“You shall know a word by the company it keeps.” -J.R. Firth

Over the past few months, I’ve become fascinated by how machine learning applies to natural language problems. In the earlier Harry Potter Text Analysis project, I wrote Python code to extract insights. By using machine learning, I can take a more sophisticated approach. I’ve been specifically learning about the open source word2vec ML algorithm from Google that aims to learn the meaning behind words.

I built an interactive word2vec model of the Harry Potter series. You can type in a word from the HP books and my application returns the 7 most similar words from the series. From this page, you can continue exploring more words. I recommend first trying out a few words here and then reading this post to understand how w2v works.

## How does it work?

In order to be processed by a machine, words must be given a numerical value. W2V takes the text input and turns it into many vectors. In math, a vector is a quantity having a direction and magnitude. Vectors are used to compare a point in space relative to another. Vectors can be added and subtracted to other vectors. They can be manipulated to group words with similar meanings into clusters and produce analogies.

## Turning Sentences into Vectors

The list of unique words, or features, from the 2 sentences are: {a, dumbledore, greatest, is, harry, the, wizard}.

This is used to create feature vectors for each sentence by counting the number of times each word appears in each sentence.

In sentence 1, “dumbledore”, “greatest” and “the” appear 0 times, and “a”, “is”, “harry” and “wizard” appear 1 time each.

In this example, there is a maximum of 7 features since there are only 7 unique words. In real world applications there may be tens of thousands of features (or words) the algorithm must consider. More features generally mean better results, but sometimes reducing the number of features is beneficial for speed improvements. This is called dimensionality reduction.

The Harry Potter series has 1,086,621 total words throughout 7 books. I haven’t calculated the number of unique words in the whole series, but I know HP 5 (the longest book) has 12,624 unique words. I’ll use that for an example. Let’s assume the entire series has 12,624 unique words (even though we know it’s more). I can run a dimensionality reduction to reduce features to only 10,000 features to simplify my model. However, it would never make sense to have 15,000 features as I don’t have 15,000 unique words.

In text analytics, stop words like “the”, “of”, and “and” are often removed. However, this is not recommended in w2v since it accepts a list of sentences and removing stop words may alter the meaning of a sentence.

## Word Similarity

The cosine of an angle (see green below) is the ratio of the adjacent side of a triangle to the hypotenuse.

This can be used to calculate unknown lengths or angles in triangles. Trigonometry functions are used in electronic communications, signal processing, designing buildings, and many other real-life applications.

My understanding of this part is a little fuzzy and I didn’t find any documentation which explains this simply enough (check out pg 5 of this paper, even they aren’t exactly sure what’s going on… but it works!) but here’s the general concept. After turning word into vectors, we can graph these vectors. The cosine of the angle between word vectors can be used to compare how similar words are. The smaller an angle is, the larger (closer to 1) the cosine value will be. In this case, a larger cosine value indicates word vectors which have similar meanings.

On the left, the angle between words “gryffindor” and “slytherin” is small, so the cosine of the angle between them is large. This means they are similar. This makes sense- they are both Hogwarts houses. On the right, words “gryffindor” and “petunia” are moving in separate directions. The angle between them is large, so the cosine of the angle between them is small which means these words are not similar.

This is an example in a two-dimensional space with only an X and Y axis. In w2v, vectors are in a high-dimensional space meaning there are many more axes or directions in the space.

## Why Do I Care?

Stitchfix, an online personal styling service for women, has customers fill out a web form requesting specific styles of clothing. They use a w2v algorithm to parse through this text and determine when to send a box of vacation clothing for a trip to Belize instead of maternity outfits.

Spotify, an music streaming service, uses w2v to find and recommend related artists to users based on the tracks they are streaming. This predicts that a 2pac fan is much more likely to enjoy Notorious B.I.G. than Florence + the Machine.

Dice, an online technology job board, uses w2v to create clusters of similar words in job postings. They can infer that job postings containing “data mining” should be clustered with those that contain “statistics” since these words are similar.

W2V is not 100% accurate and it is not wise to assume otherwise. The quality of the model depends on the amount and quality of the data fed into the model. In real world applications, it is typical for a human to review the w2v results before putting its output in production.

If you want to learn about how I built the front-end and back-end of my application, read my next post “Behind the Scenes of Word2Vec on Harry Potter”.

Written by

Written by

## 35 Questions To Test Your Knowledge of Python Sets

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just \$5/month. Upgrade