Co-occurrence matrix & Singular Value Decomposition (SVD)

Published in

Analytics Vidhya

6 min readMar 13, 2020

What are Word Embeddings?

Word Embeddings are the texts converted into numbers. There may be different numerical representations of the same text.

Why do we need Word Embeddings?

Many Machine Learning algorithms and almost all Deep Learning Architectures are incapable of processing strings or plain text in their raw form. They require numbers as inputs to perform any sort of job, be it classification, regression, etc. in broad terms.

What are the different types of Word Embeddings?

The different types of word embeddings can be broadly classified into two categories-

Frequency-based Embedding

There are generally three types of vectors that we encounter under this category: Count Vector, TF-IDF Vector, Co-Occurrence Matrix with a fixed context window.

Prediction-based Embedding

There are generally two types of vectors that we encounter under this category: Continuous Bag of Words(CBOW), Skip-Gram.

In this article, we’ll be focusing on the Co-Occurrence Matrix with a fixed context window(one of the Frequency-based techniques).

Before diving into the Co-Occurrence Matrix with a fixed context window, let’s discuss the disadvantages of the other two Frequency-based techniques.

Count Vector and TF-IDF do not capture the position in semantics, co-occurrences in the document, etc.

Co-Occurrence Matrix with a fixed context window

The big idea — Similar words tend to occur together and will have a similar context for example — Apple is a fruit. Mango is a fruit.

Apple and mango tend to have a similar context i.e fruit.

Co-occurrence — For a given corpus, the co-occurrence of a pair of words say w1 and w2 is the number of times they have appeared together in a Context Window.
Context Window — Context window is specified by a number and the direction.

How to form the Co-occurrence matrix:

The matrix A stores co-occurrences of words.
In this method, we count the number of times each word appears inside a window of a particular size around the word of interest.
Calculate this count for all the words in the corpus.

Let us understand all of this with the help of an example.

Let our corpus contain the following three sentences:

I enjoy flying
I like NLP
I like deep learning

Let window size =1. This means that context words for each and every word are 1 word to the left and one to the right. Context words for:

I = enjoy(1 time), like(2 times)
enjoy = I (1 time), flying(2 times)
flying = enjoy(1 time)
like = I(2 times), NLP(1 time), deep(1 time)
NLP = like(1 time)
deep = like(1 time), learning(1 time)
learning = deep(1 time)

Therefore, the resultant co-occurrence matrix A with fixed window size 1 looks like :

Problem — For a huge corpus, this co-occurrence matrix could become really complex (high-dimension).

Solution — Singular value decomposition(SVD) and principal component analysis(PCA) are two eigenvalue methods used to reduce a high-dimensional dataset into fewer dimensions while retaining important information.

Singular values of a matrix

Let A be an m*m matrix. The product Aᵀ A is a symmetric matrix.

Thus Aᵀ A has n Linearly Independent eigenvectors v₁ ,v₂…vₙ and real eigenvalues λ₁, λ₂… λₙ .

We know that the eigenvalues of Aᵀ A are also all non-negative.

Let λ be an eigenvalue of Aᵀ A with corresponding eigenvector v. Then,

Label the eigenvectors v₁ ,v₂…vₙ so that λ₁ ≥λ₂ ≥… λₙ. Let σᵢ= √λᵢ.

Thus, σ₁ ≥σ₂ ≥… σₙ ≥0. The numbers σ₁, σ₂,… σₙ are called the singular values of matrix A.

Singular Value Decomposition (SVD)

Let A be an m*n matrix. Then there exists a factorization of A,

where U is an m*m orthogonal matrix, V is an n*n orthogonal matrix and Σ is an m*n matrix of the form,

where σ₁ ≥σ₂ ≥… σₙ ≥0

for r ≤m,n

Any such factorization is called the Singular Value Decomposition of matrix A. The matrices U and V are not unique. However, Σ is unique. The elements on the diagonal of D, namely σ₁, σ₂,… σₙ are the non-zero singular values of A. The columns of U are called left singular vectors of A and columns of V are called right singular vectors of A.

You can find examples for SVD here.

Apply SVD on co-occurrence matrix X

Here, the dimension of X, U, Σ, and Vᵀ is |V|*|V|.

Reducing dimensionality by selecting the first k singular features.

Now,

The dimension of X is |V|*|V|.
The dimension of U is |V|*k.
The dimension of Σ is k*k.
The dimension of Vᵀ is k*|V|.

Let’s understand this by taking the previous example. We have the co-occurrence matrix A of dimension 7*7. Initially, just after the decomposition, the dimension of U, Σ, and Vᵀ is also 7*7.

Matrix U (before selecting k singular features)

Now after reducing dimensionality by selecting the first k singular features,

The dimension of A is 7*7.
The dimension of U is 7*3.
The dimension of Σ is 3*3.
The dimension of Vᵀ is 3*7.

Matrix U (after selecting k singular features)

Advantages of Co-occurrence Matrix

It preserves the semantic relationship between words. i.e man and woman tend to be closer than man and apple.
It uses SVD at its core, which produces more accurate word vector representations than existing methods.
It uses factorization which is a well-defined problem and can be efficiently solved.
It has to be computed once and can be used anytime once computed. In this sense, it is faster in comparison to others.

Disadvantages of Co-Occurrence Matrix

It requires huge memory to store the co-occurrence matrix.
But, this problem can be circumvented by factorizing the matrix out of the system for example in Hadoop clusters etc. and can be saved.

Problems with SVD method

The dimensions of the matrix change very often (new words are added very frequently and corpus changes in size).
The matrix is extremely sparse since most words do not cooccur.
The matrix is very high dimensional in general ( 106 *106 )
Quadratic cost to train (i.e. to perform SVD)
Requires the incorporation of some hacks on X to account for the drastic imbalance in word frequency

You can find the implementation of the Co-occurrence matrix and SVD here.