PCA Disadvantages & how to resolve them

A dive into Incremental, Randomized, Kernel & Sparse PCA

Published in

Data Science in your pocket

5 min readAug 5, 2020

If you have been in Data Science for some time, you must have heard of Principal Component Analysis (PCA) which is used for dimensionality reduction.

If not, just go through Post_1 & Post_2. These posts should clear every doubt you might have.

PCA can be summarized as follows:

courtesy: https://sebastianraschka.com/Articles/2015_pca_in_3_steps.html

Were you able to figure out a few cons in standard PCA? let me point a few of them

Standard PCA struggles with Big Data when we need out-of-core (when data is too big to fit in RAM) computation.
Also, standard PCA can detect only linear relationships between variables/features. What if relationships are non-linear?
If I have many features in my dataset (say 400000), & I have a rough idea that the dataset has ~400 features that would be useful (others are just noise or 0s or anything else; not knowing which features), is there a way I can shortlist these 400 features before applying PCA to save computation?
Sometimes, the transformed data we generate after applying PCA should ideally be sparse (why? requires some explanation, will be discussing this later in the post) but standard PCA always generates dense expressions.

Once you are able to digest the above 4 points, let’s figure out a solution one by one.

Incremental PCA

Before starting off, we need to know a few terminologies

Incremental Learning: It refers to training over a continuous input data (can be a stream of data flowing continuously).How this can help? This approach can help us in 2 ways: A) When training data is huge & can’t fit in RAM at once B)Resolves issues due to data drift & concept drift
np.memmap(): This function helps in generating a memory map file for any dataset using which we can access a small segment of the dataset without loading the entire file.

Incremental PCA helps us to resolve our 1st problem i.e. PCA over big data where entire data can’t be accommodated in memory at once.

It follows the ideology of Incremental learning by running multiple SVD decomposition on multiple batches (say m batches made out of dataset) comprising of ’n’ samples to get to the final Principal Components.
These batches are loaded in the memory using memory-mapped files generated using np.memmap(). Hence, a batch of data can be accessed without loading the entire data into the memory
As the algorithm used is out of scope for this post, if you are excited enough, you can read it here.

Kernel PCA

As mentioned earlier, standard PCA is able to depict only linear relationships & hence, correct dimension reduction is possible if data has only linear relations.
But when the relationship between different features in the dataset is non-linear, kernel PCA can be used.
It applies a transformation function (can be RBF, polynomial, Gaussian, etc; similar to kernels in SVM) & then all the steps are similar to standard PCA.

Randomized/Approximate PCA

A few terminologies to know:

The rank of a matrix:

It is the number of linearly independent rows/columns present in the matrix.
As row rank (number of independent rows)=column rank(number of independent columns), calculating either row or column rank won’t make a difference.
By linearly independent, I mean that a row/column can’t be produced by applying a linear transformation to other rows/columns or a combination of other rows/columns. Do see this example

Low-rank approximation:

It is a minimization problem involving the transformation of a matrix A X B(rank R) to another matrix A X B(rank r) where r<R.
This new matrix is an optimized version of the original matrix(removed noise)
It must be noted that ‘r’ is decided by the user. A good resource for a deeper dive.

Consider we have a Face recognition task with images of dimension 2056x2056=4227136 features on flattening. Also, we know there exist very few key features that would be contributing the most in recognizing a face that is eyes, lips, ears, etc. (like cheeks, chin, forehead, eyebrows, etc. won’t play a massive role in recognizing a face as all humans have most of the face features quite similar)which accounts for say 400 features out 4227136.

Can we somehow ignore the rest of the 4223136 features as this would save a lot of computation?

We might ignore computation costs if we have a dataset of about 1k-2k images, but when the dataset has about 10000k-20000k, the problem becomes quite evident

Here comes Randomized PCA to help us out. How?

It applies Low-rank approximation to the dataset. For the above case, we will be keeping ‘r’=400.

But how actually it helped us with an improved computation for PCA?

The rank of a matrix determines the non-zero eigenvalues of the matrix
So, when we transform our original data using low-rank approximation, we are rejecting unimportant Eigenvalues & hence eliminating the least important Principal Components automatically.
Now, if earlier, it was possible to get say ’n’ eigenvalues, we might be able to get ‘m’ eigenvalues for the transformed dataset where m<n hence reduction in computation. Rest other steps remain the same as standard PCA

Sparse PCA

Reconsider the Face Recognition problem discussed in Randomized PCA. It can be the case that the data might be sparse (values for just some parts of the face like lips, eyes, nose, etc & 0s for others).

Hence, after applying PCA, shouldn’t the output generated also be sparsed?

I feel yes but standard PCA will generate dense expressions only.

How to resolve this?

Sparse PCA aims at extracting sparse eigenvectors that best represent the data. To know how this is done, do read here

It must be noted that all these variants are present in sklearn!! so no scratch codes are required for implementation

And that’s it!! time to wrap this up.