Word2Vec on Harry Potter

Published in

Becoming a Data Analyst

6 min readJul 22, 2016

“You shall know a word by the company it keeps.” -J.R. Firth

Over the past few months, I’ve become fascinated by how machine learning applies to natural language problems. In the earlier Harry Potter Text Analysis project, I wrote Python code to extract insights. By using machine learning, I can take a more sophisticated approach. I’ve been specifically learning about the open source word2vec ML algorithm from Google that aims to learn the meaning behind words.

I built an interactive word2vec model of the Harry Potter series. You can type in a word from the HP books and my application returns the 7 most similar words from the series. From this page, you can continue exploring more words. I recommend first trying out a few words here and then reading this post to understand how w2v works.

What is Word2Vec?

Word2Vec is a machine learning algorithm created by Google in 2013, which makes highly accurate guesses about a word’s meaning from raw text. It helps computers understand language with no human supervision.

How does it work?

Word2Vec is a group of two-layer neural network models. A neural network is a computer program which is modeled off the human brain and nervous system. Neural networks are organized in layers. They take in a number of inputs (red layer below), communicate those to hidden layers where those inputs are processed via weighted connections (blue layers below), and generate an output (green layer below). For example, in a human brain the input layer might be vision, scent and hearing which get fed to the brain to process. The output would be what a human says in response to these inputs. The input layer for w2v is a large amount of text which is processed in the hidden layer. For my application, the output is a list of similar words. As w2v gets more input text, it becomes smarter and gives better results. This cycle of continuous learning is a very complicated process, but is essential to create programs with artificial intelligence.

Source: https://cs231n.github.io/neural-networks-1/

In order to be processed by a machine, words must be given a numerical value. W2V takes the text input and turns it into many vectors. In math, a vector is a quantity having a direction and magnitude. Vectors are used to compare a point in space relative to another. Vectors can be added and subtracted to other vectors. They can be manipulated to group words with similar meanings into clusters and produce analogies.

Turning Sentences into Vectors

Let’s explore the example below:

The list of unique words, or features, from the 2 sentences are: {a, dumbledore, greatest, is, harry, the, wizard}.

This is used to create feature vectors for each sentence by counting the number of times each word appears in each sentence.

In sentence 1, “dumbledore”, “greatest” and “the” appear 0 times, and “a”, “is”, “harry” and “wizard” appear 1 time each.

In this example, there is a maximum of 7 features since there are only 7 unique words. In real world applications there may be tens of thousands of features (or words) the algorithm must consider. More features generally mean better results, but sometimes reducing the number of features is beneficial for speed improvements. This is called dimensionality reduction.

The Harry Potter series has 1,086,621 total words throughout 7 books. I haven’t calculated the number of unique words in the whole series, but I know HP 5 (the longest book) has 12,624 unique words. I’ll use that for an example. Let’s assume the entire series has 12,624 unique words (even though we know it’s more). I can run a dimensionality reduction to reduce features to only 10,000 features to simplify my model. However, it would never make sense to have 15,000 features as I don’t have 15,000 unique words.

In text analytics, stop words like “the”, “of”, and “and” are often removed. However, this is not recommended in w2v since it accepts a list of sentences and removing stop words may alter the meaning of a sentence.

Word Similarity

How does the model know when two words are similar? It uses cosine similarity to compare vectors. Don’t worry, you don’t need to know linear algebra to understand the basic concept behind this.

The cosine of an angle (see green below) is the ratio of the adjacent side of a triangle to the hypotenuse.

This can be used to calculate unknown lengths or angles in triangles. Trigonometry functions are used in electronic communications, signal processing, designing buildings, and many other real-life applications.

My understanding of this part is a little fuzzy and I didn’t find any documentation which explains this simply enough (check out pg 5 of this paper, even they aren’t exactly sure what’s going on… but it works!) but here’s the general concept. After turning word into vectors, we can graph these vectors. The cosine of the angle between word vectors can be used to compare how similar words are. The smaller an angle is, the larger (closer to 1) the cosine value will be. In this case, a larger cosine value indicates word vectors which have similar meanings.

On the left, the angle between words “gryffindor” and “slytherin” is small, so the cosine of the angle between them is large. This means they are similar. This makes sense- they are both Hogwarts houses. On the right, words “gryffindor” and “petunia” are moving in separate directions. The angle between them is large, so the cosine of the angle between them is small which means these words are not similar.

This is an example in a two-dimensional space with only an X and Y axis. In w2v, vectors are in a high-dimensional space meaning there are many more axes or directions in the space.

Why Do I Care?

W2V helps solve many difficult real world problems: conversational systems (think chatbots), recommendation algorithms, and knowledge extraction. Many companies use w2v models to interact with their users more effectively.

Stitchfix, an online personal styling service for women, has customers fill out a web form requesting specific styles of clothing. They use a w2v algorithm to parse through this text and determine when to send a box of vacation clothing for a trip to Belize instead of maternity outfits.

Spotify, an music streaming service, uses w2v to find and recommend related artists to users based on the tracks they are streaming. This predicts that a 2pac fan is much more likely to enjoy Notorious B.I.G. than Florence + the Machine.

Dice, an online technology job board, uses w2v to create clusters of similar words in job postings. They can infer that job postings containing “data mining” should be clustered with those that contain “statistics” since these words are similar.

Want to learn more?

Some popular implementations of w2v include: gensim (what I use), Google tensorflow, and Spark MLlib.

W2V is not 100% accurate and it is not wise to assume otherwise. The quality of the model depends on the amount and quality of the data fed into the model. In real world applications, it is typical for a human to review the w2v results before putting its output in production.

If you want to learn about how I built the front-end and back-end of my application, read my next post “Behind the Scenes of Word2Vec on Harry Potter”.