Topic Modeling with LDA and NMF on the ABC News Headlines dataset

Topic Modeling is an unsupervised learning approach to clustering documents, to discover topics based on their contents. It is very similar to how K-Means algorithm and Expectation-Maximization work. Because we are clustering documents, we will have to process individual words in each document to discover topics, and assign values to each based on the distribution of these words. This increases the amount of data we are working with, so to handle the large amount of processing required for clustering documents, we will have to utilize efficient sparse data structures.
In this post, we will walk through two different approaches for topic modeling, and compare their results. These approaches are LDA (Latent Derilicht Analysis), and NMF (Non-negative Matrix factorization). Let’s talk about each of these before we move onto code. We will look at their definitions, and some basic math that describe how they work.
LDA
LDA, or Latent Derelicht Analysis is a probabilistic model, and to obtain cluster assignments, it uses two probability values: P( word | topics) and P( topics | documents). These values are calculated based on an initial random assignment, after which they are repeated for each word in each document, to decide their topic assignment. In an iterative procedure, these probabilities are calculated multiple times, until the convergence of the algorithm. Let’s look at an example and walk through a step of the algorithm:
We will look at three small documents, that will have 2 prominent topics, food and pets.

Document 1 is primarily about food (Topic A), Document 2 is about pets (Topic B), and Document 3 is evenly split between A and B, with one un-classified word Fish. Using the existing information, we will obtain the topic to which Fish in Document 3 should be classified to.
First, we will calculate what the probability is of the word appearing in the different topics:
P( ‘Fish’ | topic A) = 0.75
P( ‘Fish’ | topic B) = 0.25
Now, we need the probability of the topics in the document with the word in it, which is going to be:
P( topic A | Document 3) = P( topic B | Document 3) = 0.5
because they are evenly split.
Weighing the conclusions from both probabilities, we will assign the word ‘Fish’ in Document 3 to topic A.
The next step of the algorithm would be to repeat this step for all words in each document, and then repeat the entire classification multiple times. With multiple iterations, we will obtain better and better topic classifications each time because we will have more updated information for each document.
NMF
Non-negative Matrix Factorization is a Linear-algeabreic model, that factors high-dimensional vectors into a low-dimensionality representation. Similar to Principal component analysis (PCA), NMF takes advantage of the fact that the vectors are non-negative. By factoring them into the lower-dimensional form, NMF forces the coefficients to also be non-negative.
Given the original matrix A, we can obtain two matrices W and H, such that A= WH. NMF has an inherent clustering property, such that W and H represent the following information about A:
A (Document-word matrix) — input that contains which words appear in which documents.
W (Basis vectors) — the topics (clusters) discovered from the documents.
H (Coefficient matrix) — the membership weights for the topics in each document.
We calculate W and H by optimizing over an objective function (like the EM algorithm), updating both W and H iteratively until convergence.

In this objective function, we measure the error of reconstruction between A and the product of it’s factors W and H, based on euclidean distance.
Using the objective function, the update rules for W and H can be derived, and we get:


The updated values are calculated in parallel operations, and using the new W and H, we re-calculate the reconstruction error, repeating this process until convergence.
Code
Now, I will show how to implement these two algorithms. We will apply topic modeling on the ABC Millions Headlines dataset (published on Kaggle recently: https://www.kaggle.com/therohk/million-headlines)
Imports:
The main libraries we are using in this project are Pandas and Numpy for their data structures, Scipy for sparse operations, Gensim (an open-source library that has different Topic Modeling modules) for LDA, and SKLearn (an open-source Machine Learning library) for NMF.
import pandas as pd;
import numpy as np;
import scipy as sp;
import sklearn;
import sys;
from nltk.corpus import stopwords;
import nltk;
from gensim.models import ldamodel
import gensim.corpora;
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer;
from sklearn.decomposition import NMF;
from sklearn.preprocessing import normalize;
import pickle;Data Loading and Pre-processing
We are using the ABC News headlines dataset. Some lines are badly formatted (very few), so we are skipping those.
data = pd.read_csv('data/abcnews-date-text.csv',
error_bad_lines=False);# We only need the Headlines text column from the data
data_text = data[['headline_text']];
We need to remove stopwords first. Casting all values to float will make it easier to iterate over because it will remove most edge cases:
data_text = data_text.astype('str');for idx in range(len(data_text)):
#go through each word in each data_text row, remove stopwords, and set them on the index.
data_text.iloc[idx]['headline_text'] = [word for word in data_text.iloc[idx]['headline_text'].split(' ') if word not in stopwords.words()];
#print logs to monitor output
if idx % 1000 == 0:
sys.stdout.write('\rc = ' + str(idx) + ' / ' + str(len(data_text)));#save data because it takes very long to remove stop words
pickle.dump(data_text, open('data_text.dat', 'wb'))#get the words as an array for lda input
train_headlines = [value[0] for value in data_text.iloc[0:].values];
Implementing LDA:
Initialize the number of Topics we need to cluster:
num_topics = 10;We will use the gensim library for LDA. First, we obtain a id-2-word dictionary. For each headline, we will use the dictionary to obtain a mapping of the word id to their word counts. The LDA model uses both of these mappings.
id2word = gensim.corpora.Dictionary(train_headlines);corpus = [id2word.doc2bow(text) for text in train_headlines];lda = ldamodel.LdaModel(corpus=corpus, id2word=id2word, num_topics=num_topics);
Generating LDA topics:
We will iterate over the number of topics, get the top words in each cluster, and add them to a DataFrame, than print these words.
def get_lda_topics(model, num_topics):
word_dict = {};
for i in range(num_topics):
words = model.show_topic(i, topn = 20);
word_dict['Topic # ' + '{:02d}'.format(i+1)] = [i[0] for i in words];
return pd.DataFrame(word_dict);Call the function and obtain the topics:
get_lda_topics(lda, num_topics)
Implementing NMF
For NMF, we need to obtain a design matrix. To improve results, I am going to apply TfIdf transformation to the counts.
#the count vectorizer module needs string inputs, not array, so I join them with a space. This is a very quick operation.train_headlines_sentences = [' '.join(text) for text in train_headlines]
Now, we obtain a Counts design matrix, for which we use SKLearn’s CountVectorizer module. The transformation will return a matrix of size (Documents x Features), where the value of a cell is going to be the number of times the feature (word) appears in that document.
To reduce the size of the matrix, to speed up computation, we will set the maximum feature size to 5000, which will take the top 5000 best features that can contribute to our model.
vectorizer = CountVectorizer(analyzer='word', max_features=5000);
x_counts = vectorizer.fit_transform(train_headlines_sentences);Next, we set a TfIdf Transformer, and transform the counts with the model.
transformer = TfidfTransformer(smooth_idf=False);
x_tfidf = transformer.fit_transform(x_counts);And now we normalize the TfIdf values to unit length for each row.
xtfidf_norm = normalize(x_tfidf, norm='l1', axis=1)And finally, obtain a NMF model, and fit it with the sentences.
#obtain a NMF model.
model = NMF(n_components=num_topics, init='nndsvd');#fit the model
model.fit(xtfidf_norm)
Generating NMF topics:
We are going to iterate over each topic, obtain the most important scoring words in each cluster, and add them to a Dataframe, as before.
def get_nmf_topics(model, n_top_words):
#the word ids obtained need to be reverse-mapped to the words so we can print the topic names.
feat_names = vectorizer.get_feature_names()
word_dict = {};
for i in range(num_topics):
#for each topic, obtain the largest values, and add the words they map to into the dictionary.
words_ids = model.components_[i].argsort()[:-20 - 1:-1]
words = [feat_names[key] for key in words_ids]
word_dict['Topic # ' + '{:02d}'.format(i+1)] = words;
return pd.DataFrame(word_dict);Call the function and obtain the topics:
get_nmf_topics(model, 20)
Results
The two tables above, in each section, show the results from LDA and NMF on both datasets. There is some coherence between the words in each clustering. For example, Topic #02 in LDA shows words associated with shootings and violent incidents, as evident with words such as “attack”, “killed”, “shooting”, “crash”, and “police”. Other topics show different patterns.
On the other hand, comparing the results of LDA to NMF also shows that NMF performs better. Looking at Topic #01, we can see there are many first names clustered into the same category, along with the word “interview”. This type of headline is very common in news articles, with wording similar to “Interview with John Smith”, or “Interview with James C. on …”.
We also see two topics related to violence. First, Topic #03 focuses on police related terms, such as “probe”, “missing”, “investigate”, “arrest”, and “body”. Second, Topic #08 focuses on assault terms, such as “murder”, “stabbing”, “guilty”, and “killed”. This is an interesting split between the topics because although the terms in each are very closely related, one focuses more on police-related activity, and the other more on criminal activity. Along with the first cluster which obtain first-names, the results show that NMF (using TfIdf) performs much better than LDA.
Final code
You can view my final code on GitHub or Kaggle, or below:

