Topic Modeling with Scikit Learn
Latent Dirichlet Allocation (LDA) is a algorithms used to discover the topics that are present in a corpus. A few open source libraries exist, but if you are using Python then the main contender is Gensim. Gensim is an awesome library and scales really well to large text corpuses. Gensim, however does not include Non-negative Matrix Factorization (NMF), which can also be used to find topics in text. The mathematical basis underpinning NMF is quite different from LDA. I have found it interesting to compare the results of both of the algorithms and have found that NMF sometimes produces more meaningful topics for smaller datasets. NMF has been included in Scikit Learn for quite a while but LDA has only recently (late 2015) been included. The great thing about using Scikit Learn is that it brings API consistency which makes it almost trivial to perform Topic Modeling using both LDA and NMF. Scikit Learn also includes seeding options for NMF which greatly helps with algorithm convergence and offers both online and batch variants of LDA.
How do LDA and NMF work?
I won’t go into any lengthy mathematical detail — there are many blogs posts and academic journal articles that do. While LDA and NMF have differing mathematical underpinning, both algorithms are able to return the documents that belong to a topic in a corpus and the words that belong to a topic. LDA is based on probabilistic graphical modeling while NMF relies on linear algebra. Both algorithms take as input a bag of words matrix (i.e., each document represented as a row, with each columns containing the count of words in the corpus). The aim of each algorithm is then to produce 2 smaller matrices; a document to topic matrix and a word to topic matrix that when multiplied together reproduce the bag of words matrix with the lowest error.
No magic here — you need to specify the number of topics!
How many topics? Well that is the question! Both NMF and LDA are not able to automatically determine the number of topics and this must be specified.
I searched far and wide for an exciting dataset and finally selected the 20 Newsgoups dataset. I’m just being sarcastic — I selected a dataset that is both easy to interpret and load in Scikit Learn. The dataset is easy to interpret because the 20 Newsgroups are known and the generated topics can be compared to the known topics being discussed. Headers, footers and quotes are excluded from the dataset.
The creation of the bag of words matrix is very easy in Scikit Learn — all the heavy lifting is done by the feature extraction functionality provided for text datasets. A tf-idf transformer is applied to the bag of words matrix that NMF must process with the TfidfVectorizer. LDA on the other hand, being a probabilistic graphical model (i.e. dealing with probabilities) only requires raw counts, so a CountVectorizer is used. Stop words are removed and the number of terms included in the bag of words matrix is restricted to the top 1000.
NMF and LDA with Scikit Learn
As mentioned previously the algorithms are not able to automatically determine the number of topics and this value must be set when running the algorithm. Comprehensive documentation on available parameters is available for both NMF and LDA. Initialising the W and H matrices in NMF with ‘nndsvd’ rather than random initialisation improves the time it takes for NMF to converge. LDA can also be set to run in either batch or online mode.
Displaying and Evaluating Topics
The structure of the resulting matrices returned by both NMF and LDA is the same and the Scikit Learn interface to access the returned matrices is also the same. This is great and allows for a common Python method that is able to display the top words in a topic. Topics are not labeled by the algorithm — a numeric index is assigned.
The derived topics from NMF and LDA are displayed below. From the NMF derived topics, Topic 0 and 8 don’t seem to be about anything in particular but the other topics can be interpreted based upon there top words. LDA for the 20 Newsgroups dataset produces 2 topics with noisy data (i.e., Topic 4 and 7) and also some topics that are hard to interpret (i.e., Topic 3 and Topic 9). I’d say the NMF was able to find more meaningful topics in the 20 Newsgroups dataset.
Topic 0: people don think like know time right good did say
Topic 1: windows file use dos files window using program problem card
Topic 2: god jesus bible christ faith believe christian christians church sin
Topic 3: drive scsi drives hard disk ide controller floppy cd mac
Topic 4: game team year games season players play hockey win player
Topic 5: key chip encryption clipper keys government escrow public use algorithm
Topic 6: thanks does know mail advance hi anybody info looking help
Topic 7: car new 00 sale price 10 offer condition shipping 20
Topic 8: just like don thought ll got oh tell mean fine
Topic 9: edu soon cs university com email internet article ftp send
Topic 0: government people mr law gun state president states public use
Topic 1: drive card disk bit scsi use mac memory thanks pc
Topic 2: said people armenian armenians turkish did saw went came women
Topic 3: year good just time game car team years like think
Topic 4: 10 00 15 25 12 11 20 14 17 16
Topic 5: windows window program version file dos use files available display
Topic 6: edu file space com information mail data send available program
Topic 7: ax max b8f g9v a86 pl 145 1d9 0t 34u
Topic 8: god people jesus believe does say think israel christian true
Topic 9: don know like just think ve want does use good
In my next blog post, I’ll discuss topic interpretation and show how top documents within a theme can also be displayed.
Full Code Listing
It’s amazing how much can be achieved with just 36 lines of Python code and some Scikit Learn magic. The full code listing is provided below: