Topic Modeling — LDA Mallet Implementation in Python — Part 3

Senol Kurt
The Startup
Published in
6 min readJul 1, 2020

--

In Part 2, we ran the model and started to analyze the results. Here, we will look at ways how topic distributions change over time. Let’s get started!

As you may recall, we defined a variable tm_results to store topic distributions for each document. pprintthe first item of tm_resultsto see how it looks like:

pprint(tm_results[0])

As seen above, the first item is a list of tuples and you get topic distributions for the first document. Since tm_results store topic distributions for all the documents, we may create a dataframe to analyze the data. We can convert each record to a dictionary and then use pandas.DataFrame.from_records to transform tm_resultsto a dataframe:

df_weights = pd.DataFrame.from_records([{v: k for v, k in row} for row in tm_results])
df_weights.columns = ['Topic ' + str(i) for i in range(1,11)]
df_weights

We can add “Year” column and get the average of yearly topic weights:

df_weights['Year'] = df.Year
df_weights.groupby('Year').mean()

As you can see the yearly average weights of topics are so close. In my experience, Mallet generally produces close probabilities among topics. Therefore, I prefer to get dominant topics for each document and then do the analysis based on dominant topics.

To get which topic is dominant in each document we can use pandas.DataFrame.idxmax() function. It returns the index of maximum value over a requested axis. But first, we need to drop ‘Year’ column and left only 10 topics as columns. Then, we can get column index of maximum value for each row and assign the values to ‘Dominant’ topic.

df_weights['Dominant'] = df_weights.drop('Year', axis=1).idxmax(axis=1)
df_weights.head()

Now, we can get the percentage of dominant topics in a given year by grouping the dataframe by ‘Year’ column and call value_counts(normalize=True) function on ‘Dominant’ column. When we perform the functions so far, we get a multi-index pandas Series. To convert it to a dataframe where rows are “years” and columns are “topics”, we need to chain unstack()function.

df_dominance = df_weights.groupby('Year')['Dominant'].value_counts(normalize=True).unstack()
df_dominance

As you can see from the above output, the trends are more clear now. We can also get trends for each “journal”. First, we add “Journal” column to df_weights dataframe and perform similar actions that we did above.

df_journals = df_weights.groupby(['Journal', 'Year'])['Dominant'].value_counts(normalize=True).unstack()
df_journals.head(15)

We can also plot topic distributions over time for each journal. To perform this action, first, we need to reset_index of df_journals . Then we need to get topic prevelance values for each journal for each year. We can get these values with pandas melt() function. It unpivots a dataframe from wide format to long format. Parameters of melt()function:

  • id_vars : ‘Journal’ and ‘Year’ columns are used as identifier variables.
  • value_vars: Topic 1 to Topic 10.
  • var_name: ‘Topicis the name to be used for the ‘variable’ column.
df_journals.reset_index(inplace=True)
df_melted = df_journals.melt(id_vars=['Journal', 'Year'], value_vars=['Topic ' + str(i) for i in range(1,11)], var_name='Topic', value_name='Prevelance')
df_melted

As you can see above, there are prevelance values per journal per year per topic (15 journals x 10 years x 10 topics = 1500 rows). Now, we can visualize prevelance values for each journal with seaborn relplot() function.

sns.relplot(x='Year', y="Prevelance", col="Journal", col_wrap=3, hue='Topic', data=df_melted, kind="line", height=10, style="Topic", dashes=False, ci=None)

Since we use unreal publication years and journal names it’s unlikely to get insights from the graph. But, these graphs are useful to interpret topic changes over time. It’s also difficult to differentiate 10 lines in a plot. You can prefer lineplots if you have a fewer number of topics.

We can also use pandas dataFrame.plot.area() function to produce stacked area plots for better interpretation. It plots each column of the dataframe as an area on the chart. So, initially we set_index “Journal” and “Year” columns and remain only Topics 1 to Topic 10 as columns. Then we loop over 15 journals, get cross-section from dataframe using pandas DataFrame.xs()function, and then plot area plots over 10 years period.

You will get one plot for each journal and I think it’s much easier to interpret topic distributions over time with stacked area plots. That’s it for this subject and we can jump to finding optimal topic numbers as promised previously.

Optimal Topic Numbers

To find optimal numbers of topics, we run the model for several number of topics, compare the coherence score of each model, and then pick the model which has the highest coherence score. Below you can find the function for computing coherence scores within a specified range of topic numbers.

We run the function for up to 50 topics with 2 intervals. Note that this will take some time and tqdm will provide a progress meter that you can follow. For my corpus, it takes approximately 20 minutes.

lda_models, coherence_scores = topic_model_coherence_generator(corpus=corpus, texts=data_ready, dictionary=id2word, start_topic_count=2, end_topic_count=50, step=2, cpus=-1)

We can plot the coherence scores per topic numbers as below:

x_ax = range(2, 31, 1)
y_ax = coherence_scores
plt.figure(figsize=(12, 6))
plt.plot(x_ax, y_ax, c='r')
plt.axhline(y=0.43, c='k', linestyle='--', linewidth=2)
plt.rcParams['figure.facecolor'] = 'white'
xl = plt.xlabel('Number of Topics')
yl = plt.ylabel('Coherence Score')
plt.show()

As seen in the plot, the coherence score increases rapidly and then starts to follow almost a horizontal trajectory around 12 topics. Therefore for this corpus 12 topics seem like a good option.

Conclusion

We have come a long way to perform topic modeling analysis. First, we did some exploratory data analysis to understand the dataset. Then we preprocess the data to make it ready for the model. We ran the model and try to interpret the results. Later, we looked at topic distributions over time, and finally, we tried to find an optimal number of topics.

Thank you very much for reading so far. I hope you enjoy the series and find it helpful. I would be glad to hear any feedback and comments. You can also contact me on Linkedin.

Happy coding!

References:

  1. Sarkar, D. (2016). Text Analytics with Python.
  2. https://www.machinelearningplus.com/nlp/topic-modeling-gensim-python/#14computemodelperplexityandcoherencescore
  3. https://jeriwieringa.com/2017/06/21/Calculating-and-Visualizing-Topic-Significance-over-Time-Part-1/
  4. https://www.tutorialspoint.com/gensim/gensim_creating_lda_mallet_model.htm

--

--