Computational Linear Algebra: Topic Modelling with NMF and SVD
Getting started with Matrix factorization and topic modelling
Topic Modeling with NMF and SVD
Topic Modeling is a great way to get started with matrix factorizations. We’ll start with term-document matrix:
We can decompose this into one tall thin matrix times one wide short matrix.
This representation does not take into account word order or sentence structure. It’s an example of a bag of words-approach.
Motivation
Consider the most extreme case — reconstructing the matrix using an outer product of two vectors. Clearly, in most cases, we won’t be able to reconstruct the matrix exactly. But if we had one vector with the relative frequency of each vocabulary word out of the total word count, and one with the average number of words per document, then that outer product would be as close as we can get.
Now consider increasing those matrices to two columns and two rows. The optimal decomposition would now be to cluster the documents into two groups, each of which has as different distribution of words as possible to each other, but as similar as possible amongst the documents in the cluster. We will call those two groups “topics”. And we would cluster the words into two groups, based on those which most frequently appear in each of the topics.
In this lecture,
We’ll take a dataset of documents in several different categories, and find topics for them. Knowing the actual categories help us evaluate if the topics we find make sense.
We’ll try this with Singular Value Decomposition and Non-Negative Matrix Factorization
import numpy as np
from sklearn.datasets import fetch_20newsgroups
from sklearn import decomposition
from scipy import linalg
import matplotlib.pyplot as plt
%matplotlib inline
np.set_printoptions(suppress=True)
Additional Resources
- Data source: Newsgroups are discussion groups on Usenet, which was popular in the 80s and 90s before the web really took off. This dataset includes 18,000 newsgroup posts with 20 topics.
- Chris Manning’s book chapter on matrix factorization and LSI
- Scikit learn truncated SVD LSI details
Other Tutorials
- Scikit-Learn: Out-of-core classification of text documents: uses Reuters-21578 dataset (Reuters articles labelled with ~100 categories), HashingVectorizer
- Text Analysis with Topic Models for the Humanities and Social Sciences: uses British and French Literature dataset of Jane Austen, Charlotte Bronte, Victor Hugo, and more
Set Up Data
Scikit Learn comes with a number of built-in datasets, as well as loading utilities to load several standard external datasets.
Newsgroups are discussion groups on Usenet, which were popular in the 80s and 90s before the web really took off. This dataset includes 18,000 newsgroup posts with 20 topics.
categories = ['alt.atheism', 'talk.religion.misc', 'comp.graphics', 'sci.space']
remove = ('headers', 'footers', 'quotes')
newsgroups_train = fetch_20newsgroups(subset='train', categories=categories, remove=remove)
newsgroups_test = fetch_20newsgroups(subset='test', categories=categories, remove=remove)
newsgroups_train.filenames.shape, newsgroups_train.target.shape
((2034,), (2034,))
Look at some of the data
Can you guess which categories are they in?
print("\n".join(newsgroups_train.data[:4]))
In what categories does the data belong?
np.array(newsgroups_train.target_names)[newsgroups_train.target[:4]]
array([‘comp.graphics’, ‘talk.religion.misc’, ‘sci.space’, ‘alt.atheism’],
dtype=’<U18')
We assign a target index to each of them:
newsgroups_train.target[:10]
array([1, 3, 2, 0, 2, 0, 2, 1, 2, 1], dtype=int64)
num_topics, num_top_words = 6, 8
Scikit Learn has a method that will extract all the words count for us
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
vectorizer = CountVectorizer(stop_words='english')
vectors = vectorizer.fit_transform(newsgroups_train.data).todense() # (documents, vocab)
vectors.shape #, vectors.nnz / vectors.shape[0], row_means.shape
(2034, 26576)
print(len(newsgroups_train.data), vectors.shape)
2034 (2034, 26576)
vocab = np.array(vectorizer.get_feature_names_out())
vocab[7000:7050]
Singular Value Decomposition
“SVD is not nearly as famous as it should be.” — Gilbert Strang
We would clearly expect that the words that appear most frequently in one topic would appear less frequently in the other — otherwise, that word wouldn’t make a good choice to separate out the two topics. Therefore, we expect the topics to be orthogonal.
The SVD algorithm factorizes a matrix into one matrix with orthogonal columns and one with orthogonal rows (along with a diagonal matrix, which contains the relative importance of each factor).
SVD is an exact decomposition since the matrices it creates are big enough to fully cover the original matrix. SVD is extremely widely used in linear algebra, and specifically in data sciences, including:
- semantic analysis
- collaborative filtering/recommendations (winning entry for Netflix Prize)
- calculate Moore-Penrose pseudoinverse
- data compression
- principal component analysis (will be covered later in course)
%time U, s, Vh = linalg.svd(vectors, full_matrices=False)
CPU times: user 27.2 s, sys: 812 ms, total: 28 s
Wall time: 27.9 s
print(U.shape, s.shape, Vh.shape)
(2034, 2034) (2034,) (2034, 26576)
What can be say about singular values?
# plot for the singular values
plt.plot(s);
plt.plot(s[:10])
num_top_words=8
def show_topics(a):
top_words = lambda t: [vocab[i] for i in np.argsort(t)[:-num_top_words-1:-1]]
topic_words = ([top_words(t) for t in a])
return [' '.join(t) for t in topic_words]
show_topics(Vh[:10])
We get topics that match the kinds of clusters we would expect! This is despite the fact that this is an unsupervised algorithm — which is to say, we never actually told the algorithm how our documents are grouped.
Non-negative Matrix Factorization (NMF)
Motivation
Idea
Rather than constraining our factors to be orthogonal, another idea would be to constrain them to be non-negative. NMF is a factorization of a non-negative data set V:
V = WH
into non-negative matrices W, H. Often positive factors will be more easily interpretable.
NMF is a non-exact factorization that factors into one skinny positive matrix and one short positive matrix. NMF is NP-hard and non-unique. There are a number of variations on it, created by adding different constraints.
Applications of NMF
- Face Decompositions
- Collaborative Filtering, eg movie recommendations
- Audio source separation
- Chemistry
- Bioinformatics and Gene Expression
- Topic Modeling (our problem!)
NMF from Sklearn
First, we will use scikit-learn’s implementation of NMF:
m,n = vectors.shape
d = 5 # number of topics
clf = decomposition.NMF(n_components = d, random_state = 135)
W1 = clf.fit_transform(vectors)
H1 = clf.components_
show_topics(H1)
TF-IDF
Topic Frequency-Inverse Document Frequency is a way to normalize term counts by taking into account how often they appear in a document, how long the document is and how common/rare they are in a document.
vectorizer_tfidf = TfidfVectorizer(stop_words='english')
vectors_tfidf = vectorizer_tfidf.fit_transform(newsgroups_train.data)
W1 = clf.fit_transform(vectors_tfidf)
H1 = clf.components_
show_topics(H1)
plt.plot(clf.components_[0])
clf.reconstruction_err_
43.7129260580871
NMF in summary
The Good: fast and easy to use
The Bad: took years of research and expertise to create
Notes:
1. For NMF, the matrix needs to be at least as tall as it is wide, or we get an error with fit_transform
2. Can use df_min
in CounterVectorizer
to only look at words that were in at least k of the split texts
NMF from scratch in numpy, using SGD
Gradient Descent
The key idea of standard gradient descent:
- Randomly choose some weights to start
- Loop:
- Use weights to calculate a prediction
- Calculate the derivative of the loss
- Update the weights
- Repeat step 2 lots of times. Eventually, we end up with some decent weights.
Key: We want to decrease our loss and the derivative tells us the direction of the steepest descent.
Stochastic Gradient Descent (SGD)
It is an incredibly useful optimization method ( it is also the heart of deep learning since used for backpropagation)
We evaluate the loss using all our data for standard gradient descent, which is slow.
In stochastic gradient descent, we evaluate our loss function on just a sample of our data. We would get different loss values on different samples of the data. It turns out this is still an effective way to optimize.
Applying SGD to NMF
lam = 1e3
lr = 1e-2
m, n = vectors_tfidf.shape
W1= clf.fit_transform(vectors)
H1 = clf.components_
show_topics(H1)
mu = 1e-6
def grads(M, W, H):
R = W@H-M
return R@H.T + penalty(W, mu)*lam, W.T@R + penalty(H, mu)*lam # dW, dH
def penalty(M, mu):
return np.where(M>=mu,0, np.min(M - mu, 0))
def upd(M, W, H, lr):
dW,dH = grads(M,W,H)
W -= lr*dW; H -= lr*dH
def report(M,W,H):
print(np.linalg.norm(M-W@H), W.min(), H.min(), (W<0).sum(), (H<0).sum())
W = np.abs(np.random.normal(scale=0.01, size=(m,d)))
H = np.abs(np.random.normal(scale=0.01, size=(d,n)))
report(vectors_tfidf, W, H)
44.42710988660707 1.4206941402729504e-06 6.220086649529336e-08 0 0
upd(vectors_tfidf,W,H,lr)
report(vectors_tfidf, W, H)
44.419184642449935 -0.0007765716529699559 -7.190500098920772e-05 134 284
for i in range(50):
upd(vectors_tfidf,W,H,lr)
if i % 10 == 0: report(vectors_tfidf,W,H)
show_topics(H)
This is painfully slow to train! Lots of parameter fiddling and still slow to train
PyTorch
It is a Python framework for tensors and dynamic neural networks with GPU acceleration. Similar to NumPy, only with increased parallelization of using GPU
If you are not using a GPU, you will need to remove the .cuda() from this method below.
import torch
import torch as tc
from torch.autograd import Variable
def V(M): return Variable(M, requires_grad=True)
v=vectors_tfidf.todense()
t_vectors = torch.Tensor(v.astype(np.float32))
mu = 1e-5
def grads_t(M, W, H):
R = W.mm(H)-M
return (R.mm(H.t()) + penalty_t(W, mu)*lam,
W.t().mm(R) + penalty_t(H, mu)*lam) # dW, dH
def penalty_t(M, mu):
return (M<mu).type(tc.FloatTensor)*torch.clamp(M - mu, max=0.)
def upd_t(M, W, H, lr):
dW,dH = grads_t(M,W,H)
W.sub_(lr*dW); H.sub_(lr*dH)
def report_t(M,W,H):
print((M-W.mm(H)).norm(2), W.min(), H.min(), (W<0).sum(), (H<0).sum())
t_W = tc.FloatTensor(m,d)
t_H = tc.FloatTensor(d,n)
t_W.normal_(std=0.01).abs_();
t_H.normal_(std=0.01).abs_();
d=6; lam=100; lr=0.05
for i in range(1000):
upd_t(t_vectors,t_W,t_H,lr)
if i % 100 == 0:
report_t(t_vectors,t_W,t_H)
lr *= 0.9
for i in range(1000):
upd_t(t_vectors,t_W,t_H,lr)
if i % 100 == 0:
report_t(t_vectors,t_W,t_H)
lr *= 0.9
plt.plot(t_H.cpu().numpy()[0])
t_W.mm(t_H).max()
tensor(0.3829)
t_vectors.max()
tensor(1.)
PyTorch Autograd Introduction
Above, we used our knowledge of what the gradient of the loss function was to do SGD from scratch in PyTorch. However, PyTorch has an automatic differentiation package, autograd which we could use instead. This is really useful, in that we can use autograd on problems where we don’t know what the derivative is.
The approach we use below is very general and would work for almost any optimization problem.
In PyTorch, Variables have the same API as tensors, but Variables remember the operations used to create them. This lets us take derivatives.
x = Variable(torch.ones(2, 2), requires_grad=True)
print(x)
tensor([[1., 1.],
[1., 1.]], requires_grad=True)
print(x.data)
tensor([[1., 1.],
[1., 1.]])
print(x.grad)
None
y = x + 2
print(y)
tensor([[3., 3.],
[3., 3.]], grad_fn=<AddBackward0>)
z = y * y * 3
out = z.sum()
print(z, out)
tensor([[27., 27.],
[27., 27.]], grad_fn=<MulBackward0>) tensor(108., grad_fn=<SumBackward0>)
out.backward()
print(x.grad)
tensor([[18., 18.],
[18., 18.]])
Using Autograd for NMF
lam=1e6
pW = Variable(tc.FloatTensor(m,d), requires_grad=True)
pH = Variable(tc.FloatTensor(d,n), requires_grad=True)
pW.data.normal_(std=0.01).abs_()
pH.data.normal_(std=0.01).abs_();
def report():
W,H = pW.data, pH.data
print((M-pW.mm(pH)).norm(2).data[0], W.min(), H.min(), (W<0).sum(), (H<0).sum())
def penalty(A):
return torch.pow((A<0).type(tc.FloatTensor)*torch.clamp(A, max=0.), 2)
def penalize(): return penalty(pW).mean() + penalty(pH).mean()
def loss(): return (M-pW.mm(pH)).norm(2) + penalize()*lam
M = Variable(t_vectors)
opt = torch.optim.Adam([pW,pH], lr=1e-3, betas=(0.9,0.9))
lr = 0.05
#report()
for i in range(1000):
opt.zero_grad()
l = loss()
l.backward()
opt.step()
if i % 100 == 99:
report()
lr *= 0.9 # learning rate annealling
h = pH.data.cpu().numpy()
show_topics(h)
plt.plot(h[0]);
Comparing Approaches
Scikit-Learns’s NMF
1. Fast
2. No Parameter tuning
3. Relies on decades of academic research
Using PyTorch and SGD
- Took an hour to implement, didn’t have to be NMF experts
2. Parameters were fiddly
3. Not as fast
Truncated SVD
We saved a lot of time when we calculated NMF by only calculating the subset of columns we were interested in. Is there a way to get this benefit with SVD? Yes, there is! It’s called truncated SVD. We are just interested in the vectors corresponding to the largest singular values.
Shortcomings of classical algorithms for decomposition:
- Matrices are “stupendously big”
- Data are often missing or inaccurate. Why spend extra computational resources when imprecision of input limits the precision of the output?
- Data transfer now plays a major role in the time of algorithms. Techniques that require fewer passes over the data may be substantially faster, even if they require more flops (flops = floating point operations).
- Important to take advantage of GPUs.
Advantages of randomized algorithms:
- inherently stable
- performance guarantees do not depend on subtle spectral properties
- needed matrix-vector products can be done in parallel.
Randomized SVD
Reminder: full SVD is slow. This is the calculation we did above using scipy's Linalg SVD:
vectors.shape
(2034, 26576)
%time U, s, Vh = linalg.svd(vectors, full_matrices=False)
CPU times: user 27.2 s, sys: 812 ms, total: 28 s
Wall time: 27.9 s
print(U.shape, s.shape, Vh.shape)
(2034, 2034) (2034,) (2034, 26576)
Fortunately, there is a faster way:
%time u, s, v = decomposition.randomized_svd(vectors, 5)
CPU times: user 144 ms, sys: 8 ms, total: 152 ms
Wall time: 154 ms
Conclusion
In this article, we talked about Topic Modelling and revisited SVD. In the next article, we will talk about yet another real-world implementation of Linear algebra.
Stay Tuned.