Practice NLTK, Word2vec , PCA, wordcloud, Jieba on Harry Potter Series and Chinese content

9 min readJan 15, 2019

Recently I am learning NLP and find it quite challenging. NLP seems more complicated then Computer Vision because pixel patterns are universal but language is diverse, the way to create language model in Chinese is different from English nor German. Take tokenization as an example, we can use space bar to tokenize an English sentence “ I am a boy”, but how to do tokenization in Chinese sentence “我是一個男孩”?

In this article I will briefly demonstrate how to apply NLTK (Natural Language Toolkit) and Word2Vec to do some basic analysis on Harry Potter Series, then visualize the relation of word vectors by PCA/wordcloud, apply Jieba on Chinese content. Let’s begin!

Data

Prepare Harry Potter corpus in txt, can obtain the txt from google directly or converting from pdf. Notice we may have encoding error therefore I suggest to open with encoding=”utf-8" to avoid potential encoding issue.

Apply NLTK library for tokenization and data cleansing

Depend on how you read the txt, most commonly the content is just a string. We can choose to tokenize it into a list of words or sentences by using nltk.word_tokenize(content) or nltk.sent_tokenize(content).

In NLP we have a term “stopwords” to describe commonly used but not semantically meaningful words, such as “and” , “ to”, “this”. We can filter away these words to avoid incorrect interpretation. We can access the list of stopwords by nltk.corpus.stopwords. We can create custom stopwords list. Here are some examples:

After that we can apply some basic statistic to obtain the following information, such as number words, sentence, lexical diversity, dispersion plot.etc.

Some basic info obtained from Harry Potter Book 1

Dispersion plot

I like the build-in dispersion plot function so much. It can visualize the occurrence of specific words along the whole corpus. From the plot we have a rough idea of the relative importance of characters. (Assume important character occurs more frequently in the corpus). Most of us we have already familiarized with the plot and characters in Harry Potter.

Imagine for a person who have never read Harry Potter, he can still guess Harry is the most important character while Ron, Hermione and Hagrid are likely important as well.

Let’s repeat the above basic analysis on all the 7 books of Harry Potter and do some comparisons.

Total words count after removing punctuation:

The fifth book Harry Potter and the Order of the Phoenix has the largest amount of words, more than double of the first book!

words count removing punctuation across 7 books

Average sentence length:

The average sentence length have similar shape to the above total word count graph except book 4 Harry Potter and the Goblet of Fire.

Lexical diversity:

Lexical diversity is directly proportional to the diversity of vocab. It is the ratio of number of unique words/ number of words. Surprisingly the first few episodes have higher lexical diversity, this is probably due to the much lower number of words, almost half of the amount to the last few episodes(less repeated words). Therefore if someone wants to learn English by reading Harry Potter, the second book Harry Potter and the Chamber of Secrets maybe the best time-efficient choice.

Discrepancy of the word counts from the official number ?

This is probably due to NLTK tokenization error. Whenever NLTK encounters words with “n’t”, it tends to tokenize the word into two separate words. As you can see in the below picture “didn’t” is tokenized into “did” and “n’t”, instead of “didn’t”. Different tokenization method has their own pros and cons, use them depend on your situation. Sometime text.split() with regular expression can perform pretty well.

Different way to tokenize a sample sentence

Let’s go to Word2vec, I will use the Word2vec package from gensim to demonstrate. What we does in Word2vec is to put the words into an embedding space so similar words will have closer distance.

Word2vec expects the input as a list of sentences, we can create a function to read the txt, tokenize and create the list. We can make use of gensim.utils.simple_preprocess and yield function to do this. We can adjust some threshold to keep word with min or max characters. For example if min_len=2 , the word “a” will not be kept in your model.

Then we pass the content into gensim.models.Word2Vec, we can fine tune the hyperparameters such as window size and embedding size. Generally speaking the higher the embedding size, the larger capacity the model to learn the semantic relationship between word vectors. Training is very fast in Word2Vec, much faster than in LSTM due to different structure.

See the cosine similarity of words:

We can see which words are most similar to our target words. Being similar doesn’t mean must have similar meaning, we can regard them as having some sort of relationship.

We try the word “fight” in our Word2vec model and get the above 10 words. The top 3 results “kill, die, control” are obviously related to “fight”. Initially I have no idea what is kedavra and avada, then I google and find that “Avada Kedavra” is the spell to make Instant death in Harry Potter world. Well, these two words are definitely related to “fight” for sure.

Find the odd one our of a list of words:

Except finding the most similar words, we can also find the odd word

I am glad the model doesn’t return “harry” for me.

Get the cosine similarity between words:

We can get the similarity value between words for comparison. When I was young, people always hope Hermione can be together with Harry. You know what, according to this model maybe Harry should be married with Ron as they are more related to each other.

Data visualization:

Since the embedding layer have very high dimension, we have to use PCA/t-sne to reduce the dimension for visualization.

Since there are total 15865 word vectors in this model, it is impractical to visualize every word vector in one single plot. Therefore I select [‘harry’,’ron’,’hermione’,’wand’,’car’,’train’,’water’] as example. We can see that Name characters ‘harry’,’ron’,’hermione’ tend to cluster together, transportation “train” and “car” form another cluster. This is the main essence of word2vec, similar words stay closer to each other.

Word Cloud:

Visualization by word cloud is easy and eye catching. It visualizes the top k frequent words in corpus. I encounter problem when I first run the word cloud because there are many junk words, as you can see in below even “said” , “n’t” are highlighted.

This is the sorted counter of word frequency

Although I have already used nltk.corpus.stopwords, there are still a lot of stopwords and junk words. I have to edit the custom stop word list to filter away more useless words. Here is the word cloud after second filtering.

Now let’s go to the demonstration of Jieba in Chinese content. Jieba is used in tokenization therefore we can feed the corpus into gensim word2vec model.

Data:

From wikipedia we can obtain its backup in different time steps with different languages. I use the Chinese backup at 1st Jan 2019 as the raw data. We can make use of the gensim.corpora.WikiCorpus function to read and extract the corpus easily.

As you can see, even the Chinese backup contained English. Initially I think of filtering away the English to avoid being potential noise to the word2vec model, but I find the result is fine therefore at the end I didn’t filter the English.

Convert into traditional Chinese:

I am HongKonger and we use traditional Chinese not simplified Chinese, I have to convert the content into traditional Chinese. We can apply whatever way to convert from simplified Chinese to traditional Chinese. Since the corpus is large (~1.2GB), I suggest you use python package such as opencc to do the conversion. Also load the big5 dictionary from jieba github for better cutting. Actually there is a Cantonese version of wikipedia backup to be downloaded, I am not sure it use proper written language or spoken language therefore I choose the normal Chinese version.

Apply stopwords and tokenization:

This part is similar to the word2vec example in Harry Potter, but this time we use Jieba to apply stopwords and tokenization instead of NLTK/gensim

Put into gensim word2vec model:

Same as the English version

Cosine similarity:

Let’s test the word “China", words such as Chinese gov, TV station, Shanghai are returned, it makes sense.

Let’s make some fun to test “China,Hong Kong and Taiwan“

According to this word2vec model by using 2019 wiki backup, Hong Kong is more related to China than Taiwan, which is true.

When I apply the word “Deep learning” or “Machine Learning”, it shows error as no such word in the vocabulary.

If I apply “Artificial Intelligence” I can get the following result, I see a weird term called “機器人學", I guess this “maybe” our machine learning. I don’t believe Chinese Wikipedia doesn’t contain much information for Deep learning/Machine learning.

Why have this error?

(Most likely) Possibly during tokenization, the term “深度學習" is cut into different words. Can solve this by adding userdict with jieba.load_userdict(txt_file)
May have encoding error during conversion from simplified Chinese to traditional Chinese (less likely)

The PCA, wordcloud …etc are the same as the English version so I am not going to repeat again.

I have to say Jieba is the best NLP tool in Chinese language I have ever used. It supports both simplified Chinese and traditional Chinese, people report Jieba does better job in simplified Chinese than traditional Chinese but I myself have not done experiment on it.