Topic Modeling — LDA Mallet Implementation in Python — Part 2

Senol Kurt
The Startup
Published in
5 min readJun 29, 2020

In Part 1, we created our dictionary and corpus and now we are ready to build our model. Let’s start with installing Mallet package.

!wget http://mallet.cs.umass.edu/dist/mallet-2.0.8.zip

unzip the zip file:

!unzip mallet-2.0.8.zip

We should define path to the mallet binary to pass in LdaMallet wrapper:

mallet_path = ‘/content/mallet-2.0.8/bin/mallet’

There is just one thing left to build our model. We should specify the number of topics in advance. Although there isn’t an exact method to decide the number of topics, in the last section we will compare models that have different number of topics based on their coherence scores.

For now, build the model for 10 topics (this may take some time based on your corpus):

ldamallet = gensim.models.wrappers.LdaMallet(mallet_path, corpus=corpus, num_topics=10, id2word=id2word)

Let’s display the 10 topics formed by the model. For each topic, we will print (use pretty print for a better view) 10 terms and their relative weights next to it in descending order.

from pprint import pprint
# display topics
pprint(ldamallet.show_topics(formatted=False))

Note that, the model returns only clustered terms not the labels for those clusters. We are required to label topics.

We can calculate the coherence score of the model to compare it with others.

# Compute Coherence Score
coherence_model_ldamallet = CoherenceModel(model=ldamallet, texts=data_ready, dictionary=id2word, coherence='c_v')
coherence_ldamallet = coherence_model_ldamallet.get_coherence()
print('Coherence Score: ', coherence_ldamallet)

It’s a good practice to pickle our model for later use.

import pickle
pickle.dump(ldamallet, open("drive/My Drive/ldamallet.pkl", "wb"))

You can load the pickle file as below:

ldamallet = pickle.load(open("drive/My Drive/ldamallet.pkl", "rb"))

We can get the topic modeling results (distribution of topics for each document) if we pass in the corpus to the model. You can also pass in a specific document; for example, ldamallet[corpus[0]] returns topic distributions for the first document. For the whole documents, we write:

tm_results = ldamallet[corpus]

We can get the most dominant topic of each document as below:

corpus_topics = [sorted(topics, key=lambda record: -record[1])[0] for topics in tm_results]

To get most probable words for the given topicid, we can use show_topic() method. It returns sequence of probable words, as a list of (word, word_probability) for specific topic. You can get top 20 significant terms and their probabilities for each topic as below:

topics = [[(term, round(wt, 3)) for term, wt in ldamallet.show_topic(n, topn=20)] for n in range(0, ldamallet.num_topics)]

We can create a dataframe for term-topic matrix:

topics_df = pd.DataFrame([[term for term, wt in topic] for topic in topics], columns = ['Term'+str(i) for i in range(1, 21)], index=['Topic '+str(t) for t in range(1, ldamallet.num_topics+1)]).Ttopics_df.head()

Another option is to display all the terms for a topic in a single row as below:

# set column width
pd.set_option('display.max_colwidth', -1)
topics_df = pd.DataFrame([', '.join([term for term, wt in topic]) for topic in topics], columns = ['Terms per Topic'], index=['Topic'+str(t) for t in range(1, ldamallet.num_topics+1)] )
topics_df

WordClouds

Visualize the terms as wordclouds is also a good option to present topics. Below we create wordclouds for each topic. The font sizes of words show their relative weights in the topic.

Visualization with pyLDAvis

“pyLDAvis” is also a visualization library for presenting topic models. To use this library, you need to convert LdaMallet model to a gensim model. Below is the conversion method that I found on stackvverflow:

After defining the function we call it passing in our “ldamallet” model:

ldagensim = convertldaMalletToldaGen(ldamallet)

Then, we need to transform the topic model distributions and related corpus data into the data structures needed for the visualization, as below:

import pyLDAvis.gensim as gensimvisvis_data = gensimvis.prepare(ldagensim, corpus, id2word, sort_topics=False)pyLDAvis.display(vis_data)

You can hover over bubbles and get the most relevant 30 words on the right.

Dominant Topics for Each Document

We can create a dataframe that shows dominant topic for each document and its percentage in the document.

# create a dataframe
corpus_topic_df = pd.DataFrame()
# get the Titles from the original dataframe
corpus_topic_df[‘Title’] = df.Title
corpus_topic_df[‘Dominant Topic’] = [item[0]+1 for item in corpus_topics]
corpus_topic_df[‘Contribution %’] = [round(item[1]*100, 2) for item in corpus_topics]
corpus_topic_df[‘Topic Terms’] = [topics_df.iloc[t[0]][‘Terms per Topic’] for t in corpus_topics]
corpus_topic_df.head()

We can use pandas groupby function on “Dominant Topic” column and get the document counts for each topic and its percentage in the corpus with chaining agg function.

We can also get which document makes the highest contribution to each topic:

corpus_topic_df.groupby(‘Dominant Topic’).apply(lambda topic_set: (topic_set.sort_values(by=[‘Contribution %’], ascending=False).iloc[0])).reset_index(drop=True)

That’s it for Part 2. In the next Part, we analyze topic distributions over time. I’d like to hear your feedback and comments. You can also contact me on Linkedin.

Happy coding!

--

--