PCA — Demystified.

Published in

Analytics Vidhya

7 min readOct 3, 2020

Often in machine learning, the datasets have many features with which the predictions are to be made. Principal Component Analysis (PCA) is a technique employed to reduce the dimensions. It is often misinterpreted as a feature selection technique, but it is feature extraction. In this article, I will give you an idea of PCA, the mathematics involved, Why and when do we use PCA?

Introduction

The curse of dimensionality is a situation where the attributes of data are relatively more when compared to the number of observations. Machine Learning models enjoy a high number of features, but this results in the sparsity of data points. To illustrate the problems associated, let us consider the distance between 0 and 1 on a line, which is 1. Now if we add another dimension the points now become (0,0) & (1,1) the distance increases to the square root of 2. In Three Dimensions, the distance increases to the square root of 3. If they are more number of points in this range (0,1), the similarities or differences can be magnified, but if they are less number than the points become sparse and more distant. The underlying statement is “The quality of features is important than the number of features used for training the model”

As a thumb rule, each feature should have at least 5 observations and it also depends on whether the data can capture various possible combinations. In simple words, the data frame should not be wider instead it should be taller. If the data has very high dimensions, the feature space will have empty gaps and the data points would be scattered. Dropping features is not a wise idea as it may contain important information.

The table on the left is prone to dimensionality curse, it can be transformed to table on the right which captures the different instances. (Source)

Domain knowledge can help to transform the data frame to lower dimension as shown in the above example. This isn’t the situation in most cases, for which PCA comes to the rescue. The objective of PCA is to capture the maximum variance associated with the data set but in fewer dimensions.

Perceptual Art Analogy

Let us see an example to understand how PCA works. Consider a Perceptual Art like the below image, the art takes a known form only when viewed from a particular angle. In all other views, it appears to be random art pieces that are scattered around the space (Curse of Dimensionality). The best view is the one where the art is transformed into a 2D image and all the white empty spaces are omitted. In the same way, PCA tries to find the best Direction with which it can reduce the sparsity in the feature space. Throughout this article, I would be using this analogy. So phrases like “the best view” indicate the ideal features of a dataset.

Let’s dive into Mathematics associated with PCA and understand how it finds the best direction to view from. Below is the code snippet, where two random correlated variables x1 and x2 are created. A random data point from the data set is taken and is repeatedly multiplied (Dot Product) with the Covariance matrix. The resultant Vectors from the dot product converge to the same slope (try out for any values pf x1 and x2). Effectively, the covariance matrix is trying to project the data points on one line.

Converging Slopes of the Vectors projected on to the Covariance Matrix.

Now, what if the vector (randomly selected data point) has the same converging slope? The resultant vector from the dot product would be having the same direction as the initial vector but differs in magnitude. Consider ‘e’ to be the vector and [cov] to be the covariance matrix, the below equation represents the criterion. Lambda is a scalar value, which determines the change in magnitude. The equation is the standard eigenvector equation. Vector ‘e’ is called Eigenvector and lambda is called Eigenvalue.

The Magnitude changes without any change in direction for eigenvectors upon Linear Transformation. (Source)

This looks good, but how will this help in Dimensionality Reduction? To answer, in this two-dimensional data, if we project all points on to one line (Principal Axis) and at the same time retaining the maximum variance (combinations) in the data. Dimensionality reduction can be achieved. We would make that line capture the maximum variability(spread) of data points, transforming 2D data to 1D data. From Perceptual art interpretation, we are trying to see the 2D image of the art, which is of lower dimension.

Why is Eigenvector, the best possible view?

The criterion for the best vector is that it should capture the maximum variance. Let “d” represent the data points, “d1” is the projection of “d” on to an arbitrary vector ‘j’. Maximizing the variance of d1 will result in the best ‘j’. The vector J can have variable magnitudes since the focus is on the direction. Normalizing the Vector will restrict the ‘j’. Lagrange Multipliers are used for normalizing.

Equating the derivative of variance(d1) to zero, we get an equation that represents the eigenvector of the covariance matrix of d1. That implies ‘j’ the best view is actually the eigenvector. I am keeping the math simple but for deeper understanding, you can refer to the link “Maximum Variance”.

Orthogonality of Eigenvectors

A property of Eigenvectors is that when the matrix is symmetric, the corresponding Eigen Vectors are orthogonal to each other. Consider a matrix C of n-by-n, with eigenvectors e1 of n-by-1, e2 of 1-by-n and l1 & l2 their respective eigenvalues.

Since Covariance matrix C is symmetric, we can have the following equations

from the above two equations, we get

If L1 and L2 are unique, the dot product of e2 and e1 must be zero. This indicates that the eigenvectors are orthogonal. If they are not orthogonal the new feature space would be having coordinates axis at an angle other than 90 degrees. Hence, these Vectors become new Coordinate System.

The flow of PCA.

Extract the independent Variables. Note that Target attribute should not be included in PCA Transformation.
Normalize the data, centre the values around the origin.
Extract the Eigenvectors and eigenvalues from the Covariance matrix.
Sort the eigenvalues, choose the top eigenvectors (principal axis) which capture the desired amount of variance (95% in general).
Project the data points onto the selected Eigenvectors, creating new dimensions. The dot product returns the transformed coordinates.
Build the model on these new dimensions.

The new dimensions that are formed will be a component of every single old attribute. The new Dimension ‘D’ can be seen as a linear combination of the original Dimensions. The Beta values denote the influence of a particular feature on the newly extracted feature. Like this, ’n’ new dimensions are created up to D(n). Until this point, there is no reduction in dimensions. After analyzing the variance captured by the new features, unwanted dimensions are dropped. Therefore, achieving the dimensionality reduction.

Here is a python implementation, to see PCA in action.

uknwho/MachineLearning_-DataSets_solution

Permalink Dismiss GitHub is home to over 50 million developers working together to host and review code, manage…

github.com

In the example, the transformed 10 PCA dimensions capture 96% of the total variance. Data Frame is reduced by 3 features.

When to Use PCA?

One thing to keep in mind is that PCA would be effective only when most of the data exhibit a good correlation. To understand, take a look at the below image. When there is a strong correlation, the best fit line can be obtained which results in less sparsity. The best fit line in the 3rd graph will be having data points far from the line. More dimensions would be required to capture good variance.

The new dimensions that are extracted do not have any meaning. If it isn't a requirement then PCA will be a good option. In Images and Videos, the PCA technique would be very effective as the meaning of the features is irrelevant. In addition to PCA, explore SVD another widely used dimensionality reduction technique.