LDA Topic Modeling and pyLDAvis Visualization

Xuan Qi
4 min readJun 4, 2018

--

Topic models are a suite of algorithms/statistical models that uncover the hidden topics in a collection of documents. For example, ‘romantic’, ‘scary’, and ‘family’ will appear more often in documents related to the movie. ‘technology’, ‘computer’, and ‘algorithm’ will appear more often in the computer science documents.

Popular topic modeling algorithms include latent semantic analysis (LSA), hierarchical Dirichlet process (HDP), and latent Dirichlet allocation (LDA), among which LDA has shown excellent results in practice and therefore been widely adopted.

The data is from famous American TV shows, Friends. I scraped all of the six main characters’ script from 224 episodes using Beautiful Soup. The characters are Ross Geller, Rachel Green, Monica Geller, Phoebe Buffay, Joey Tribbiani, and Chandler Bing. This post will be using LDA for Friends topic modeling.

pyLDAvis

pyLDAvis is a interactive LDA visualization python package. What my results look like? I took one screenshot of pyLDAvis result as shown in Figure 1. The area of circle represents the importance of each topic over the entire corpus, the distance between the center of circles indicate the similarity between topics. For each topic, the histogram on the right side listed the top 30 most relevant terms. LDA helped me extracted 6 main topics (Figure 1). Take the topic one for example, the most relevant terms I saw are hanukkah, fossil, guru, etc. This is very likely a topic for our paleontologist, professor and Dr. Geller. I have saved my pyLDAvis analysis results into a .html file, you can download it from my GitHub repo. How did I get this cool visual? I will explain the process step by step.

Figure 1.

NLP

Before I process the sentences, I first separate the staging directions with the actual conversation, and store them separately into pandas Dataframe.

Figure 2.

Now, I have six documents, each document contains all the sentences spoken by one character. After tokenization, lemmatization, I further filtered the stop words (e.g. a, one, and, etc.), and only picked the words with more than three letters. I then count the occurrence of each word in the document, this process is also called the bag-of-words. I applied the python package Gensim to do the LDA analysis. The notebook and code can be found in my repo.

More interesting results

I realize that even within Friends character, there are dynamic personalities to explore. Rachel’s personality developed as the show progressed. In season 1, Rachel was just a spoiled rich girl who was trying to explore the world. But by season 6, she gradually grow into a strong independent woman. The words she would use changes as her character changes, which can be reflected by the LDA topic models. I generated 10 * 6 documents by parsing the .txt file by season. There is a total of 10 seasons and 6 main characters in Friends, thus the documents I have are Rachel in season 1, season 2, etc., Chandler in season 1, season 2, etc.

Figure 3. Topic models for Rachel by season

After 50 iterations, the Rachel LDA model help me extract 8 main topics (Figure 3). There are some overlapping between topics, but generally, the LDA topic model can help me grasp the trend. If you are interested in my results, you can download it from my repo.

Closing Thoughts

How many topics to choose? or How to evaluate LDA model.

Latent Dirichlet allocation is trained on non-labeled documents. Thus one couldn’t help wondering how to evaluate this non-supervised model? LDA is typically evaluated by either measuring performance on some secondary task, such as document classification or information retrieval, or by estimating the probability of unseen held-out documents given some training documents. A better model will give rise to a higher probability of held-out documents, on average. There is a very good paper introducing varied approaches to evaluate LDA model.

Another way to “evaluate” whether your topic model is good or not is following your instinct. You can usually tell a story about the generated topics when you have a decent model. Take my topic model for example, I assigned 10 topics, after 50 iterations. I got 6 main topics, as I went into each topic, I found out that majority of the words do related to one character. For example, in the topic 4, I saw gavin, who used to be Rachel’s colleague and happened to like her. Kim, who is Rachel’s ex-boss. Bloomingdale, where Rachel used to work. Gossip, well she does gossip a lot. As you can see, the majority of the words in topic 4 are centered around the character Rachel. I can tell a story about the generated topics, so I would say my model is pretty good.

Figure 4

--

--