Credits: Mario Tama, News Getty Images

Topic Modeling for The New York Times News Dataset

An Nonnegative Matrix Factorization Approach for Classifying News Topics

Why topic modeling?

We live in a world where streams of data are continuously collected. Searching for insights from the collected information can therefore become very tedious and time-consuming. Topic modeling was designed as a tool to organize, search, and understand vast quantities of textual information.

What is topic modeling?

In machine learning, a topic model is specifically defined as a natural language processing technique used to discover hidden semantic structures of text in a collection of documents, usually called corpus. In general, each document refers to a continuous set of words, like a paragraph or an article, where each article contains a set of words.

Let’s take The New York Times dataset for this example, where each article represents a single document. The question we’re interested in solving is how often these words come up in a single document. This will allow us to categorize each document to a particular topic or a theme.

How does NMF work?

There are two major topic modeling techniques that are generally used, Latent Dirichlet Distribution (LDA) and Nonnegative Matrix Factorization (NMF). Lets focus on the latter implementation of topic modeling to cluster similar words in The New York Times news dataset.

The NMF technique examines documents and discovers topics in a mathematical framework through probability distributions. As an illustration, we first have matrix X as our data. This matrix is represented by two smaller matrices W and H, which, when multiplied, approximately reconstruct X.

There are some prerequisites prior to implementing NMF. First, data X has to have nonnegative entries. None missing, but likely many zeros. Second, the learned factorization W and H have to have nonnegative entries.

How to implement NMF?

The data used in this problem consists of 8,447 documents from The New York Times (download here). The vocabulary size is 3,012 words (download here). We first use this data to construct the matrix X, where Xij is the number of times word i appears in document j. Therefore, X is 30,128,447 and most values in X will equal zero.

Here, we factorize an N×M matrix X into a rank-K approximation W H, where W is N×K, H is K×M and all values in the matrices are nonnegative. Each value in W and H can be initialized randomly to a positive number, here I used Uniform(0,1) distribution.

We implement and run the NMF algorithm on this data by minimizing the following divergence penalty, where W and H contain nonnegative values:

For data preprocessing and matrix calculations, you can find my original codes in Coding Reference at the bottom of this article.

Lets now pick a number of topics that we want to rank and number of iterations where we can expect the objective function would diverge. Say we set the rank to 25 and run for 100 iterations, which also corresponds to learning 25 topics.

We can see the objective plot as a function of iteration below. By looking at the divergence of the objective function, we make sure that our model for clustering similar words is stable.

After running the algorithm, we want to normalize the columns of W so they sum to one. This is also to ensure that we get probabilistic distributions with no values greater than zero.

Now, say we pick 10 words for each topic. For each column of W, we list the 10 words having the largest weight and show the weight to show our expected probabilistic distributions. The i-th row of W corresponds to the i-th word in the “dictionary” provided with the data.

Topic modeling result

The following table captures a set of documents with 25 topics, which is the result that we expected. Taking a closer look of the result, Topic 7 is referring to topic in finance, Topic 13 in medical field, Topic 14 in entertainment, and Topic 24 in business. These are only 25 topics among hundreds, if not millions, other possibilities!

Challenge: Can you guess what each topic above represents? Comment down below!

Coding Reference

Note: Codes were implemented based on the raw mathematical concepts and intuitive calculations behind NMF, but a more convenient alternative could be done using Python libraries like scikit-learn.

Hope you enjoy!


Moorissa is a graduate student currently studying machine learning at Columbia University, with a hope that someday she can leverage these skills for making the world a better place, one day at a time.