Singular Value Decomposition for developers — Selecting what matters from text data.

Published in

AI Topics and discussions

10 min readJan 10, 2023

https://www.pexels.com/pt-br/foto/lupa-em-papel-branco-5561913/

The aim of this article is to provide an introduction for developers on how to represent text data as matrices, pre-process and generate insights about the pattern of words used in them. For this, we will use a mathematical operation called SVD or Singular Value Decomposition.

The SVD can be used in the context of natural language processing to identify latent semantic logic of text documents, which can be used for tasks such as document classification and knowledge retrieval.

Index

Preface
What we can do with SVD?
Pre-processing texts
Transforming texts into matrices
Applying the SVD
Background Knowledge
The SVD Process
Conclusion

Preface

When I was in high school, I didn’t enjoy math very much. Memorizing the matrix multiplication rules, which seemed so arbitrary at the time, was probably among my “most uninteresting things on planet earth”.

However, as I continued my programming studies, I encountered these same concepts again and discovered more interesting ways to learn and apply them. Mathematics and statistics opened up new possibilities for me.

What we can do with SVD?

SVD is a technique that breaks down a matrix into smaller matrices (with fewer dimensions), allowing us to discard some of the information that has little impact on the final result.

We often need to filter out meaningful information when working with digital data. This article mainly focuses on texts, but we can use SVD for all kinds of data. For example, we can use it for image/video compression, as a pre-processing technique for machine learning algorithms, or in genome analysis.

This article focuses on one application of SVD: extracting knowledge from text data (such as user online reviews, social posts, etc.). We can use SVD to find how similar two documents are, group related documents together, identify a specific pattern, and so on. This type of problem-solving with SVD is often called Latent semantic analysis.

Pre-processing texts

This is an important but tedious part of the job. Real world communication has many words that do not affect the meaning of sentences. Similarly, each data source has its own quirks and noises that need to be removed.

We will discuss some of the common techniques to pre-process text data for better analysis. Since this is a broad topic, we will not go into much detail, but provide some references for further reading.

Removing stop words

Stop words are common words in a language (such as “the” or “a” in English). They can reduce the effectiveness of the analysis and take up processing time.

Several libraries can automatically remove these words in different languages or you can download word lists and remove them manually.

This article shows some Python options for this task.

It is important to note that, besides removing generic frequent words, each data source should be examined to identify words or patterns that do not convey the message’s idea.

Stemming

Stemming is a technique that reduces a word to its “root form” by stripping off suffixes. For example, the word “loving” becomes “lov”, just like “loved” or “love”. The rules for this algorithm depend on the language.

Stemming is helpful when working with statistical data about how often a word appears in a context, as it increases the chances of finding word matches.

You can learn more about Stemming and some tools to do it here.

Transforming texts into matrices

The goal of this article is to get insights from the importance that each word has in a list of documents, based on the number of occurrences of that word. We can then infer the similarity between documents, analyzing how the patterns of these importances are similar.

Of course, we'll explain it better later. For now, its just important to understand that we have a group of documents and we need a way to transform the number of occurrences of each word (information we have) in a matrix that map the importance of each word for each document. That matrix is called "term-document matrix", and it’s our starting point.

There's a few ways to generate such matrix:

Term Frequency Matrix

Each row of the above matrix represents a word and each column represents a document.

In that specific case, we choose the simplest way to represent the importance of a word in a document: each number directly represents the count of occurrences of a word in that document. That way of represent a text data is also known as Term Frequency Matrix.

TF-IDF

A more common way to calculate the importance of the words is the TF-IDF (Term Frequency — Inverse Document Frequency) algorithm. It takes in count not only the term frequency but how unique the term is to that document. The terms that appear more across documents have their weights reduced, so if the dataset is from a context where a word "shoes" is common, it will take this word less into account.

There are many libraries and tools out there for generating a term-document matrix using TF-IDF. You can learn more about that here.

The output is the same: a matrix relating the importance of each word to each document.

Applying the SVD

We'll follow a reverse order to present SVD: First, we'll write some code, showing how to use a python library to calculate the decomposition of a term-document matrix (generated by a algorithm like TF-IDF). After that, we'll try to understand some concepts involved in the results.

The last two sections of this article will finally cover the mathematical and intuitive explanations of how SVD works.

I believe that order may increase the interest of devs for the later content.

Decomposing the matrix

Let’s assume we have a term-document matrix, where the columns represent 32 documents and the rows represent the 1024 unique words contained into them. That matrix was generated using a algorithm like TF-IDF.

The following code will import a pre-build SVD function, from linear algebra module of scipy library. Then, we'll use our term-document matrix to recover three different matrices.

from scipy.linalg import svd

U, s, VT = svd(term_document_matrix)

These results are factors of the original matrix. That means U * s * VT will produce the same term_document_matrix.

Later on this article, these results will gain complementary definitions. For now, we can say that:

U => A mxn matrix that contain information about the words, where each row represent one of the 1024 words analysed. So, we can think the information of each word as a vector of n dimensions. The number of dimensions (columns) will vary according to the size of analysed data.
s (or Σ) => is a diagonal matrix. Each number of the diagonal represents the strength of the relation between each dimension and the data. This information is important in order to discard dimensions that are less relevant.
VT (or V transposed) => A mxn matrix that contain information about the documents. Each column represent a document. So, slicing a single column, we can recover the information of each document as a vector of n dimensions.

Both U and VT matrices are ordered from the most important dimension to the less important one. So, the diagonal of Σ matrix is always decreasing.

Finding similarity

Let's say that we want to analyse the similarity between two documents: The first step is transform the columns associated with these documents in separate vectors. After that, we should cutoff dimensions that are less relevant, truncating the vectors at some index. Commonly, this index is based on Σ matrix (calculating which dimension of Σ has a value below some threshold).

import pandas as pd

VT = pd.DataFrame(VT, None, None)

# getting first 50 dimensions of documents numbers 10 and 11
vector_1 = list(VT.iloc[10, :50])
vector_2 = list(VT.iloc[11, :50])

Then, we can calculate the cosine similarity between the two vectors. The cosine similarity is a way to metrify the similarity between vectors, considering the cosine of the internal angle between them.

Above, we can see an example of two vectors in just two dimensions (more suitable to visualization).

The cosine similarity will tell us how much the vectors are "pointing to the same direction". On the other hand, the result will take less into account the magnitude of the vectors. It is a broadly used method to compare vectors in text analytics solutions.

import numpy as np
from numpy.linalg import norm
 
# comparing vectors
similarity = np.dot(vector_1,vector_2)/(norm(vector_1)*norm(vector_2))

The result is a floating-point number between -1 and 1, where the closer the value is to one of the extremes (-1 and 1), more similar the documents are.

Background knowledge

First, i’ll list some topics that are important in order to understand the SVD. It’ll not be possible to cover each one in depth, so i encourage you to visit the linked content on topics you’re not familiar with.

Also, i strongly recommend you to watch the great linear algebra series of 1Blue3Brown YouTube channel, and more specifically the linear transformation episode.

That said, I’ll try to sintetize the most important concepts here.

Geometric interpretation of matrices

You're probably familiar with the following notation for a linear equation:

ax + by + c = 0

And know that a set of linear equations can be described as a matrix.
For example:

2x + 3y = 8
5x −y = −2

In matrix format would be:

  /      \ /   \      /   \
  | 2  3 | | X |      | 8 |
  |      |.|   |  =   |   |
  | 5 -1 | | Y |      |-2 |
  \      / \   /      \   /

But, any matrix can be also interpreted by a geometric perspective, which makes much more easier to understand the effects of the multiplication or sum operations.

First, each column of a matrix mxn can be interpreted as a vector of n dimensions. So, the following 2x2 matrix..

..Can be visualized as two vectors in R2:

When multiplying two matrices, we're applying a linear transformation at each vector of original matrix, or, transforming that vector space.

Specific types of matrices makes specific movements on vector space. For example, ortogonal matrices produces a rotation and diagonal matrices produces a stretch distortion at vectors.

Linear Transformations

A transformation means the same as function for we developers. There’s input vectors and output vectors, that are modified in some sort.

Usually, we’ll use vectors in two dimensions as examples. But on real world problems of data science / analytics, the data usually has a large number of dimensions.

A formal definition would say that a linear transformation is a transformation T: Rn → Rm, satisfying:

T(u+v) = T(u) + T(v)
T(cu) = cT(u)

The first line means that: The response of that transformation for a sum of two vectors will be the same as the sum of the individual transformations of these vectors.

The second line means that: If a scalar multiplied by a vector is transformed, the result will be the same as these scalar multiplied by the result of the transformation of that vector.

These definitions means that linear transformations preserves the sum and multiplication operations on vector's magnitudes, as expected in linear systems.

Eigenvectors and Eigenvalues

An eigenvector of a linear transformation is a non-zero vector that, when a transformation is applied to it, changes only in scale (the direction remains the same). The scale factor is known as the eigenvalue associated with the eigenvector.

The eigenvectors tell us valuable information about the transformation itself. For instance, if a transformation has an eigenvector with a large eigenvalue, this tells us that the transformation has a strong tendency to stretch vectors along that direction. On the other hand, if a transformation has an eigenvector with a small eigenvalue, this tells us that the transformation has a weak tendency to stretch vectors along that direction.

The SVD process

An eigendecomposition is the factorization of a matrix, whereby the matrix is represented in terms of its eigenvectors and eigenvalues. These result is also known as the canonical form of the matrix.

This factorization is made of the form..

A = V * D * V^(-1)

..where V is a matrix whose columns are the eigenvectors of A, and D is a diagonal matrix whose diagonal entries are the corresponding eigenvalues.

Represent a matrix in terms of its eigenvectors an eigenvalues will describe better the actual movement of vectors when transformed by those matrix. Also, the eigenvalues associated with each eigenvector can tell us which eigenvectors are more relevant to the data itself, enabling us to discard information or found strong relationships between data.

The problem is that the above decomposition only works at square matrices, what makes hard to apply on real-world scenarios. The SVD presents us a way to generalize the decomposition of a matrix in terms of its eigenvectors and eigenvalues, for every matrix. 🙂

AA^T and A^TA are very special matrices. These are always square and symmetrical, regardless of the A format. Therefore, its often used one of these forms to represent a matrix as a square matrix.

If we come back to the SVD definition, we can see that we are searching for the following decomposition:

A = U * Σ * V^T

The U matrix is also known as a list of Left Singular Vectors, and can be calculated using the eigenvectors of AA^T matrix as columns. Its often used to represent the rows of a matrix in a reduced form.

The V matrix is known as a list of Right Singular Vectors, and can be calculated by eigenvectors of A^TA matrix. usually represents the columns of original matrix.

Σ is a diagonal matrix with respective eigenvalues.

Conclusion

This article aimed provide an overview about a single branch of SVD applications, the latent semantic analysis. If you want to fully explore the mathematics of SVD, i suggest you to check these content:

This great MIT class by Gilbert Strang: Link
The Steve Brunton's series about SVD: Link

—

If you have suggestions or corrections, please send them to: williandovalle@icloud.com