Computational Linear Algebra: Topic Modelling with NMF and SVD

Getting started with Matrix factorization and topic modelling

Monit Sharma
10 min readJun 21, 2023

Topic Modeling with NMF and SVD

Topic Modeling is a great way to get started with matrix factorizations. We’ll start with term-document matrix:

We can decompose this into one tall thin matrix times one wide short matrix.

This representation does not take into account word order or sentence structure. It’s an example of a bag of words-approach.

Motivation

Consider the most extreme case — reconstructing the matrix using an outer product of two vectors. Clearly, in most cases, we won’t be able to reconstruct the matrix exactly. But if we had one vector with the relative frequency of each vocabulary word out of the total word count, and one with the average number of words per document, then that outer product would be as close as we can get.

Now consider increasing those matrices to two columns and two rows. The optimal decomposition would now be to cluster the documents into two groups, each of which has as different distribution of words as possible to each other, but as similar as possible amongst the documents in the cluster. We will call those two groups “topics”. And we would cluster the words into two groups, based on those which most frequently appear in each of the topics.

In this lecture,

We’ll take a dataset of documents in several different categories, and find topics for them. Knowing the actual categories help us evaluate if the topics we find make sense.

We’ll try this with Singular Value Decomposition and Non-Negative Matrix Factorization

import numpy as np
from sklearn.datasets import fetch_20newsgroups
from sklearn import decomposition
from scipy import linalg
import matplotlib.pyplot as plt
%matplotlib inline
np.set_printoptions(suppress=True)

Additional Resources

Other Tutorials

Set Up Data

Scikit Learn comes with a number of built-in datasets, as well as loading utilities to load several standard external datasets.

Newsgroups are discussion groups on Usenet, which were popular in the 80s and 90s before the web really took off. This dataset includes 18,000 newsgroup posts with 20 topics.



categories = ['alt.atheism', 'talk.religion.misc', 'comp.graphics', 'sci.space']
remove = ('headers', 'footers', 'quotes')
newsgroups_train = fetch_20newsgroups(subset='train', categories=categories, remove=remove)
newsgroups_test = fetch_20newsgroups(subset='test', categories=categories, remove=remove)
newsgroups_train.filenames.shape, newsgroups_train.target.shape

((2034,), (2034,))

Look at some of the data

Can you guess which categories are they in?

print("\n".join(newsgroups_train.data[:4]))

In what categories does the data belong?



np.array(newsgroups_train.target_names)[newsgroups_train.target[:4]]

array([‘comp.graphics’, ‘talk.religion.misc’, ‘sci.space’, ‘alt.atheism’],
dtype=’<U18')

We assign a target index to each of them:

newsgroups_train.target[:10]

array([1, 3, 2, 0, 2, 0, 2, 1, 2, 1], dtype=int64)

num_topics, num_top_words = 6, 8

Scikit Learn has a method that will extract all the words count for us

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
vectorizer = CountVectorizer(stop_words='english')
vectors = vectorizer.fit_transform(newsgroups_train.data).todense() # (documents, vocab)
vectors.shape #, vectors.nnz / vectors.shape[0], row_means.shape

(2034, 26576)

print(len(newsgroups_train.data), vectors.shape)

2034 (2034, 26576)

vocab = np.array(vectorizer.get_feature_names_out())
vocab[7000:7050]

Singular Value Decomposition

“SVD is not nearly as famous as it should be.” — Gilbert Strang

see my previous article on SVD

We would clearly expect that the words that appear most frequently in one topic would appear less frequently in the other — otherwise, that word wouldn’t make a good choice to separate out the two topics. Therefore, we expect the topics to be orthogonal.

The SVD algorithm factorizes a matrix into one matrix with orthogonal columns and one with orthogonal rows (along with a diagonal matrix, which contains the relative importance of each factor).

SVD is an exact decomposition since the matrices it creates are big enough to fully cover the original matrix. SVD is extremely widely used in linear algebra, and specifically in data sciences, including:

  • semantic analysis
  • collaborative filtering/recommendations (winning entry for Netflix Prize)
  • calculate Moore-Penrose pseudoinverse
  • data compression
  • principal component analysis (will be covered later in course)


%time U, s, Vh = linalg.svd(vectors, full_matrices=False)

CPU times: user 27.2 s, sys: 812 ms, total: 28 s
Wall time: 27.9 s

print(U.shape, s.shape, Vh.shape)

(2034, 2034) (2034,) (2034, 26576)

What can be say about singular values?

# plot for the singular values
plt.plot(s);
plt.plot(s[:10])


num_top_words=8

def show_topics(a):
top_words = lambda t: [vocab[i] for i in np.argsort(t)[:-num_top_words-1:-1]]
topic_words = ([top_words(t) for t in a])
return [' '.join(t) for t in topic_words]



show_topics(Vh[:10])

We get topics that match the kinds of clusters we would expect! This is despite the fact that this is an unsupervised algorithm — which is to say, we never actually told the algorithm how our documents are grouped.

Non-negative Matrix Factorization (NMF)

Motivation

Idea

Rather than constraining our factors to be orthogonal, another idea would be to constrain them to be non-negative. NMF is a factorization of a non-negative data set V:

V = WH

into non-negative matrices W, H. Often positive factors will be more easily interpretable.

NMF is a non-exact factorization that factors into one skinny positive matrix and one short positive matrix. NMF is NP-hard and non-unique. There are a number of variations on it, created by adding different constraints.

Applications of NMF

NMF from Sklearn

First, we will use scikit-learn’s implementation of NMF:



m,n = vectors.shape
d = 5 # number of topics

clf = decomposition.NMF(n_components  = d, random_state = 135)
W1 = clf.fit_transform(vectors)
H1 = clf.components_
show_topics(H1)

TF-IDF

Topic Frequency-Inverse Document Frequency is a way to normalize term counts by taking into account how often they appear in a document, how long the document is and how common/rare they are in a document.



vectorizer_tfidf = TfidfVectorizer(stop_words='english')
vectors_tfidf = vectorizer_tfidf.fit_transform(newsgroups_train.data)

W1 = clf.fit_transform(vectors_tfidf)
H1 = clf.components_
show_topics(H1)
plt.plot(clf.components_[0])


clf.reconstruction_err_

43.7129260580871

NMF in summary

The Good: fast and easy to use

The Bad: took years of research and expertise to create

Notes:
1. For NMF, the matrix needs to be at least as tall as it is wide, or we get an error with fit_transform
2. Can use df_min in CounterVectorizer to only look at words that were in at least k of the split texts

NMF from scratch in numpy, using SGD

Gradient Descent

The key idea of standard gradient descent:

  1. Randomly choose some weights to start
  2. Loop:
  • Use weights to calculate a prediction
  • Calculate the derivative of the loss
  • Update the weights
  1. Repeat step 2 lots of times. Eventually, we end up with some decent weights.

Key: We want to decrease our loss and the derivative tells us the direction of the steepest descent.

Stochastic Gradient Descent (SGD)

It is an incredibly useful optimization method ( it is also the heart of deep learning since used for backpropagation)

We evaluate the loss using all our data for standard gradient descent, which is slow.

In stochastic gradient descent, we evaluate our loss function on just a sample of our data. We would get different loss values on different samples of the data. It turns out this is still an effective way to optimize.

Applying SGD to NMF

lam = 1e3
lr = 1e-2
m, n = vectors_tfidf.shape
W1= clf.fit_transform(vectors)
H1 = clf.components_
show_topics(H1)
mu = 1e-6
def grads(M, W, H):
R = W@H-M
return R@H.T + penalty(W, mu)*lam, W.T@R + penalty(H, mu)*lam # dW, dH
def penalty(M, mu):
return np.where(M>=mu,0, np.min(M - mu, 0))


def upd(M, W, H, lr):
dW,dH = grads(M,W,H)
W -= lr*dW; H -= lr*dH

def report(M,W,H): 
print(np.linalg.norm(M-W@H), W.min(), H.min(), (W<0).sum(), (H<0).sum())


W = np.abs(np.random.normal(scale=0.01, size=(m,d)))
H = np.abs(np.random.normal(scale=0.01, size=(d,n)))

report(vectors_tfidf, W, H)

44.42710988660707 1.4206941402729504e-06 6.220086649529336e-08 0 0


upd(vectors_tfidf,W,H,lr)
report(vectors_tfidf, W, H)

44.419184642449935 -0.0007765716529699559 -7.190500098920772e-05 134 284



for i in range(50):
upd(vectors_tfidf,W,H,lr)
if i % 10 == 0: report(vectors_tfidf,W,H)

show_topics(H)

This is painfully slow to train! Lots of parameter fiddling and still slow to train

PyTorch

It is a Python framework for tensors and dynamic neural networks with GPU acceleration. Similar to NumPy, only with increased parallelization of using GPU

If you are not using a GPU, you will need to remove the .cuda() from this method below.

import torch
import torch as tc
from torch.autograd import Variable
def V(M): return Variable(M, requires_grad=True)
v=vectors_tfidf.todense()
t_vectors = torch.Tensor(v.astype(np.float32))
mu = 1e-5
def grads_t(M, W, H):
R = W.mm(H)-M
return (R.mm(H.t()) + penalty_t(W, mu)*lam,
W.t().mm(R) + penalty_t(H, mu)*lam) # dW, dH

def penalty_t(M, mu):
return (M<mu).type(tc.FloatTensor)*torch.clamp(M - mu, max=0.)

def upd_t(M, W, H, lr):
dW,dH = grads_t(M,W,H)
W.sub_(lr*dW); H.sub_(lr*dH)

def report_t(M,W,H):
print((M-W.mm(H)).norm(2), W.min(), H.min(), (W<0).sum(), (H<0).sum())
t_W = tc.FloatTensor(m,d)
t_H = tc.FloatTensor(d,n)
t_W.normal_(std=0.01).abs_();
t_H.normal_(std=0.01).abs_();
d=6; lam=100; lr=0.05
for i in range(1000): 
upd_t(t_vectors,t_W,t_H,lr)
if i % 100 == 0:
report_t(t_vectors,t_W,t_H)
lr *= 0.9
for i in range(1000): 
upd_t(t_vectors,t_W,t_H,lr)
if i % 100 == 0:
report_t(t_vectors,t_W,t_H)
lr *= 0.9
plt.plot(t_H.cpu().numpy()[0])
t_W.mm(t_H).max()

tensor(0.3829)



t_vectors.max()

tensor(1.)

PyTorch Autograd Introduction

Above, we used our knowledge of what the gradient of the loss function was to do SGD from scratch in PyTorch. However, PyTorch has an automatic differentiation package, autograd which we could use instead. This is really useful, in that we can use autograd on problems where we don’t know what the derivative is.

The approach we use below is very general and would work for almost any optimization problem.

In PyTorch, Variables have the same API as tensors, but Variables remember the operations used to create them. This lets us take derivatives.

x = Variable(torch.ones(2, 2), requires_grad=True)
print(x)

tensor([[1., 1.],
[1., 1.]], requires_grad=True)

print(x.data)

tensor([[1., 1.],
[1., 1.]])

print(x.grad)

None



y = x + 2
print(y)

tensor([[3., 3.],
[3., 3.]], grad_fn=<AddBackward0>)

z = y * y * 3
out = z.sum()
print(z, out)

tensor([[27., 27.],
[27., 27.]], grad_fn=<MulBackward0>) tensor(108., grad_fn=<SumBackward0>)


out.backward()
print(x.grad)

tensor([[18., 18.],
[18., 18.]])

Using Autograd for NMF



lam=1e6
pW = Variable(tc.FloatTensor(m,d), requires_grad=True)
pH = Variable(tc.FloatTensor(d,n), requires_grad=True)
pW.data.normal_(std=0.01).abs_()
pH.data.normal_(std=0.01).abs_();

def report():
W,H = pW.data, pH.data
print((M-pW.mm(pH)).norm(2).data[0], W.min(), H.min(), (W<0).sum(), (H<0).sum())

def penalty(A):
return torch.pow((A<0).type(tc.FloatTensor)*torch.clamp(A, max=0.), 2)

def penalize(): return penalty(pW).mean() + penalty(pH).mean()

def loss(): return (M-pW.mm(pH)).norm(2) + penalize()*lam
M = Variable(t_vectors)
opt = torch.optim.Adam([pW,pH], lr=1e-3, betas=(0.9,0.9))
lr = 0.05
#report()
for i in range(1000): 
opt.zero_grad()
l = loss()
l.backward()
opt.step()
if i % 100 == 99:
report()
lr *= 0.9 # learning rate annealling
h = pH.data.cpu().numpy()
show_topics(h)
plt.plot(h[0]);

Comparing Approaches

Scikit-Learns’s NMF

1. Fast
2. No Parameter tuning
3. Relies on decades of academic research

Using PyTorch and SGD

  1. Took an hour to implement, didn’t have to be NMF experts
    2. Parameters were fiddly
    3. Not as fast

Truncated SVD

We saved a lot of time when we calculated NMF by only calculating the subset of columns we were interested in. Is there a way to get this benefit with SVD? Yes, there is! It’s called truncated SVD. We are just interested in the vectors corresponding to the largest singular values.

Shortcomings of classical algorithms for decomposition:

  • Matrices are “stupendously big”
  • Data are often missing or inaccurate. Why spend extra computational resources when imprecision of input limits the precision of the output?
  • Data transfer now plays a major role in the time of algorithms. Techniques that require fewer passes over the data may be substantially faster, even if they require more flops (flops = floating point operations).
  • Important to take advantage of GPUs.

Advantages of randomized algorithms:

  • inherently stable
  • performance guarantees do not depend on subtle spectral properties
  • needed matrix-vector products can be done in parallel.

Randomized SVD

Reminder: full SVD is slow. This is the calculation we did above using scipy's Linalg SVD:

vectors.shape

(2034, 26576)

%time U, s, Vh = linalg.svd(vectors, full_matrices=False)

CPU times: user 27.2 s, sys: 812 ms, total: 28 s
Wall time: 27.9 s

print(U.shape, s.shape, Vh.shape)

(2034, 2034) (2034,) (2034, 26576)
Fortunately, there is a faster way:

%time u, s, v = decomposition.randomized_svd(vectors, 5)

CPU times: user 144 ms, sys: 8 ms, total: 152 ms
Wall time: 154 ms

Conclusion

In this article, we talked about Topic Modelling and revisited SVD. In the next article, we will talk about yet another real-world implementation of Linear algebra.

Stay Tuned.

--

--