Application of Principal Component Analysis on a Sparse n-dimensional matrix

Problem Statement: The Principal Component Analysis does not apply to a Sparse matrix. So what approach should be taken considering reduction in cost and memory usage.

A sparse matrix is a matrix which contains higher number of zero value components than non-zero value components. There are very few non-zero values in this form of matrix.

Approach:

Let us take an example of Data X consisting of n-dimensional vectors. Matrix X is decomposed into a product of smaller matrices such that the square reconstruction error is minimized.

X ≈ AS

Matrix Decomposition into a product of smaller matrices

The different algorithms that can be used for sparse PCA are :

· Eigen Value Decomposition

· EM Algorithm

· Newtons Method

· GRQI — (Generalized Rayleigh Quotient)

·sSVD — (Sparse Singular Vector Decomposition) also Known as SVD if we generalize to arbitrary rectangular (m x n) matrix

Eigen Value Decomposition : This method computes the covariance matrix and its eigen vectors

EM Algorithm: This method iterates between

Newtons method is popular for fast convergence of data but it is complex and too costly. This method requires the Hessian matrix whose computation on a high dimensional matrix is too expensive. In a given scenario, one can still proceed with a diagonal component of hessian matrix and there is also involvement of control parameters that inserts the standard gradient descent and the diagonal newton’s method.

I have used a modified version of Whuber’s Code and irlba Package in R Programming to deal with a high dimensional matrix. I am sharing an example below on how we can we apply Principal Component Analysis on a Sparse matrix in R.

R Code:

library(‘Matrix’)

library(‘irlba’)

set.seed(42)

p <- 500000

q < — 100

i<- unlist(lapply(1:p, function(i)rep(i,sample(25:50,1))))

j<- sample(1:q, length(i), replace = TRUE)

x <- sparseMatrix(i,j,x=runif(length(i)))

t_comp<-50

system.time({

xt.x<-crossprod(x)

x.means<-colMeans(x)

xt.x<-(xt.x-m*tcrossprod(x.means))/(m-1)

svd.0<-irlba(xt.x, nu =0, nv = t_comp, tol = 1e-10)

})

#user system elapsed

# 0.20 0.030 2.923

system.time(pca<-prcomp(x,center=TRUE))

#user system elapsed

#32.178 2.702 12.322

max(abs(pca$center — x.means))

max(abs(xt.x — cov(as.matrix(x))))

max(abs(abs(svd.0$v/pca$rotation[,1:t_comp])- 1))

Using Stub and get.col to reduce RAM usage

The R code below demonstrates the approach of using Stub, get.col that reads one column X in a given time frame and reduces the RAM usage. This method computes the principal component analysis in two different ways:

1. Using SVD — (Singular Vector Decomposition)

2. Using Prcomp directly — (Principal Component Analysis)

R-code:

p <- 500000

q <-100

library(‘Matrix’)

x <- as(matrix(pmax(0,rnorm(m*n, mean=-2)), nrow=m), “sparseMatrix”)

# Compute centered version of x’x by having at most two columns of x in memory at any time

get.col<-function(i) x[,i] # — Emulates reading a column — #

system.time({

xt.x<- matrix(numeric(), q, q)

x.means<- rep(numeric(), q)

for (i in 1:q) { i.col <- get.col(i)

x.means[i] <- mean(i.col)

xt.x[i,i] <- sum(i.col*i.col)

if(i < q) {

for(j in (i+1):q) {

j.col <- get.col(j)

xt.x[i,j] <- xt.x[j,i] <- sum(j.col*i.col)

}

}

}

xt.x <- (xt.x — p * outer(x.means, x.means, `*`)) / (p-1)

svd.0 <- svd(xt.x / p)

}

)

system.time(pca <- prcomp(x, center=TRUE))

#

# Checks: Ensure all are essentially zero

#

max(abs(pca$center — x.means))

max(abs(xt.x — cov(x)))

max(abs(abs(svd.0$v / pca$rotation) — 1))

# (This has unstable calculation.)

When we set the number of columns to 12,000 and the number of principal components to 30. The Principal Component Analysis computation takes about 19 minutes to calculate 50 principal components and the RAM consumption is around 6 GB.

Advantage of Using irlba package in R is that you can specify nu to limit the algorithm to the first n principle components, which greatly increases it’s efficacy and bypasses the calculation of the XX’ matrix.

Functionality of irlba package in R:

Sample:

irlba(A, nv = 5, nu = nv, maxit = 100, work = nv + 7, reorth = TRUE, tol = 1e-05, v = NULL, right_only = FALSE, verbose = FALSE, scale = NULL, center = NULL, shift = NULL, mult = NULL, fastpath = TRUE, svtol = tol, smallest = FALSE, …)

The arguments feed includes some of the features shown below:

A = numeric real-or complex-valued matrix or real-valued sparse matrix

nv = number of right singular vectors to estimate

nu = number of left singular vectors to estimate (defaults to nv)

maxit = maximum number of iterations

work = working subspace dimension, larger values can speed convergence at the cost of memory usage

reorth = if TRUE, apply full reorthogonalization to both SVD bases, otherwise only apply reorthogonalization to the right SVD basis vectors.

verbose = logical value that when TRUE will prints status message during computation….

Python Programming on conducting Principal Component Analysis on a Sparse matrix using SVD approach for feature selection:

Sample Code for Principal Component Analysis in Python on a 2-D data

import numpy as np

import matplotlib.pyplot as plt

from matplotlib.mlab import PCA

data = np.array(np.random.randint(10,size=(10,3)))

results = PCA(data)

def PCA(data, dims_rescaled_data=2):

#This will return the data transformed in 2 dims/columns + regenerated original data

Pass in: data as 2D NumPy array #

import numpy as NP

from scipy import linalg as LA

m, n = data.shape

data -= data.mean(axis=0)

R = NP.cov(data, rowvar=False)

# This will calculate the covariance matrix

# Following we need to calculate the eigen vectors and eigen values of the covariance matrix

# we will eigh instead of eig as the variable is symmetric in nature

evals, evecs = LA.eigh(R)

idx = NP.argsort(evals)[::-1]

# This function will sort the eigen values in decreasing order

evecs = evecs[:,idx]

# Then we sort eigen vectors according to the same index

evals = evals[idx]

# This will select the first n eigen vectors (here n is the desired dimension of rescaled data array or dims_rescaled_data)

evecs = evecs[:, :dims_rescaled_data]

return NP.dot(evecs.T, data.T).T, evals, evecs

# In this scenario we are recovering the original data array from the eigen vectors of its covariance matrix and comparing that ‘recovered’ array with the original data

def test_PCA(data, dims_rescaled_data=2):

_ , _ , eigenvectors = PCA(data, dim_rescaled_data=2)

data_recovered = NP.dot(eigenvectors, m).T

data_recovered += data_recovered.mean(axis=0)

assert NP.allclose(data, data_recovered)

def plot_pca(data):

from matplotlib import pyplot as MPL

clr1 = ‘#2026B2’

fig = MPL.figure()

ax1 = fig.add_subplot(111)

data_resc, data_orig = PCA(data)

ax1.plot(data_resc[:, 0], data_resc[:, 1], ‘.’, mfc=clr1, mec=clr1)

MPL.show()

PCA Graph

If we are not able to deal with getting the data into a centralized format then the overall geometric interpretation of PCA will show that the first Principal Component will be close to the vector of means and following Principal components will be orthogonal to it, which will prevent them from approximating any Principal Components that happen to be close to the first vector

In the above example we will face the difficulty of dealing with a Sparse Matrix with n-dimensions. So how should we go ahead with PCA?.

In this scenario, the Principal Component proceeds with an SVD approach (Singular Vector Decomposition) which is explained below:

The SVD stands for ‘Singular Vector Decomposition’. The SVD transforms matrix from the right orthogonal, left orthogonal and diagonal matrix into a single matrix. There is a code share in python on how to use the SVD approach. The number of components are taken as 200

The principal component does not apply to a sparse matrix so we have taken the truncated SVD approach here in the example below.

The piece of code below was performed on a text data used for a “Natural Language Processing Project” and the column features were created using the “Truncated SVD”.

Usage of Truncated SVD for feature selection with Tfidf and Count Vectorizer

Count Vectorizer takes each element and creates a column for the two adjacent words of the comments that exists in the data set and assigns frequency to each combination of words. So, count_vectorized is the matrix that will be transformed in the Tfidf format using the tfidf_vect. The same process was repeated on the test data set. The Tfidf stands for term frequency inverse document frequency.

How does this happen?

➢ Count Vectorizer takes each text, and creates a column for each word that exists on the corpus, and sets the number of times that that word repeats, on the column, for a given text.

➢ Count Vectorizer will tokenize the documents and will count the occurrence of tokens returning them in to a sparse matrix

➢ TFID Transformer converts inverse document frequency normalization to sparse matrix. This is achieved by converting the set of raw text into TFIDF features using the TFIDF vectorizer. It is same as count vectorizer followed with TFID Transformer.

By using this approach we are ensuring that the matrix is centered and all the sections of the matrices the left orthogonal, diagonal and right orthogonal matrix are utilized.

Mathematical Structure of SVD (Singular Vector Decomposition):

C = U(Sigma)V^T

C^TC = V(Sigma)^T(Sigma)V^T

CV = U(Sigma)

The components U and V are orthogonal matrix and this means their columns are orthonormal sets and Sigma is a diagonal matrix.

Challenges: The principal component analysis with sparse matrix and high dimensional data has two major challenges:

→Standard algorithms are computationally inefficient

→In case of an over-fitted model, it does not generalize well to new data.

Conclusion:

  1. In R programming, the best way to deal with a sparse matrix and conduct Principal Component Analysis is to use the irlba package and reduce the RAM usage on your data. This package ensures you are able to define a scaled data. The convergence time taken for a high dimensional matrix on half million records was around 19–20 minutes and RAM consumption was 6 GB.
  2. The system elapsed time observed in R programming on an average was around 0.25 to 0.30
  3. Singular Vector Decomposition functions well with sparse matrix data and centers the data according to the dimensional requirement.
  4. For python programming, the Singular Vector Decomposition truncated approach is the best feasible way to deal with sparse matrix at a lower cost and higher efficacy.

Thank you for reading through the article. Please provide your valuable feedback.

Sources:

scikit-learn.org

stackoverflow.com

Data science with python Coursera

R programming coursera

(irlba Package source: https://cran.r-project.org/web/packages/irlba/irlba.pdf)

--

--

Sapanjeet Chatwal
Principal Component Analysis on a High dimensional Sparse Matrix using R and Python

I am a part-time researcher and a full-time business analytics student at the University of Connecticut, USA. I love traveling and nature photography.