Improving the Interpretation of Topic Models

In my last blog post, I used the Non Negative Matrix Factorization (NMF) and Latent Dirichlet Allocation (LDA) algorithms implemented in the amazing Scikit Learn machine learning Python library, to find topics in a document collection. The output of the derived topics involved assigning a numeric label to the topic and printing out the top words in a topic. It is common practice to simply just print the top words for each topic — a lot of the topic model browsers, visualisations and examples on the open web do this! Just displaying the top words in a topic however, may not help a user to understand what each topic is about or determine the context in which the words are used. Only displaying the top topic words fails to take advantage of all of the data being returned by both of the algorithms. In this blog post I’ll explain the matrices that both NMF and LDA return, include the code to print out the top documents in a topic and discuss ideas I have to improve the interpretation of derived topics especially when lengthy documents are included in the dataset.

Matrices Returned by NMF and LDA

Both NMF and LDA take a bag of words matrix (no documents * no words) as input. In the bag of words matrix, documents are represented as rows and words are represented as columns. Both algorithms also require the number of topics (k) that must be derived as a parameter. The output produced by the topic modelling algorithms is then 2 matrices: a document to topics matrix (no documents * k topics) and a topics to words matrix (k topics * no words). Most topic model output only uses the topics to words matrix and displays the words with the highest weights in a topic. A better understanding of the topic can be gained by displaying the top documents in a topic as well which is relatively simple to do with Scikit Learn.

Displaying both Top Words and Documents in a Topic

Lets start by writing a new display_topics() method, which takes both the words to topics matrix (H) and the topics to documents matrix (W) as arguments. The method also needs to take the document collection (documents) and number of top documents (no_top_documents) to display in additional to the words (feature_names) and number of top words (no_top_words) to display as arguments. The display_topics method prints out a numerical index as the topic name, prints the top words in the topic and then prints the top documents in the topic. The top words and top documents have the highest weights in the returned matrices. The argsort() method is used to sort the row or column of the matrix and returns the indexes for the cells that have the highest weights in order.

We need to obtain the word to topics matrix (H) and the topics to documents matrix (W) from both the NMF and LDA algorithms. The word to topics matrix (H) can be obtained from the component_ attribute of the model after .fit() is called. Getting the topics to document matrix, is a little tricky but after reading the Scikit Learn api documents for each algorithm it will all make sense. Calling the transform() method on the algorithm model will return the topic to document matrix (W). The complete code that obtains the H and W matrices from NMF and LDA and then calls the display_topics() method is included below:

An Illustrative Yet Toy Example

A tiny and I mean really tiny dataset is used to illustrate the printing of the top words and documents in a topic. The tiny tiny dataset has 2 topics which are about user interfaces and graphs/trees. There are 9 really short sentences that make up the dataset. The code below uses NMF and LDA to find 2 topics, print 4 top words and 4 top documents in a topic:

The topics derived from NMF and LDA are displayed below. Both NMF and LDA do a good job of finding what we know are the topics.

NMF Topics
Topic 0:
trees graph minors survey
- Graph minors IV: Widths of trees and quasi-ordering
- The intersection graph of paths in trees
- The generation of random, binary, unordered trees
- Graph minors: A survey

Topic 1:
user time response interface
A survey of user opinion of computer system response time
Relation of user-perceived response time to error measurement
The EPS user interface management system
Human machine interface for Lab ABC computer applications

LDA Topics:
Topic 0:
user response time computer
- A survey of user opinion of computer system response time
- Relation of user-perceived response time to error measurement
- The EPS user interface management system
- Human machine interface for Lab ABC computer applications

Topic 1:
trees graph human minors
- Graph minors IV: Widths of trees and quasi-ordering
- Graph minors: A survey
- The intersection graph of paths in trees
- Human machine interface for Lab ABC computer applications

Ideas for a Topic Modelling Browser

Displaying full documents when printing out the topic is not going to be practical when the size of the documents are large. The technique is only directly useful for short documents (e.g., tweets) or single paragraph documents. A solution would be to only display the snippets within the document that includes any of the top words. In a future blog post, I’ll post some Python code that implements this idea and provides a keywords-in-context view of a derived topic. Combining this idea with a visualization tool like LDAVis will make a really useful Topic Model Browser that will help users interpret and explore derived topics.