Principal Component Analysis

Published in

Vision and Language Group

9 min readJun 2, 2020

Principal component analysis (PCA) is one of a family of techniques for taking high-dimensional data and using the dependencies between the variables to represent it in a more tractable, lower-dimensional form, without losing too much information. It is one of the simplest and most robust ways of achieving dimensionality reduction.

What we do is that we start with p-dimensional vectors, and want to summarize them by projecting down into a q-dimensional subspace. There are several equivalent ways of deriving the principal components mathematically. The simplest one is by finding the projections which maximize the variance because the higher the variation, the easier it is to distinguish individual entities even after reducing feature dimensions.

The first principal component is the direction in space along which projections have the highest variance. The second principal component is the direction that maximizes variance among all directions orthogonal to the first. The kth component is the variance-maximizing direction orthogonal to the previous k − 1 component. There are p principal components in all.

Important

Geometrically, PCA corresponds to “centering the dataset” and then rotating it to align the axis of highest variance with the principle axis.

Assumptions

Sample size: Minimum of 150 observations and ideally a 5:1 ratio of observation to features.
Correlations: The feature set is correlated, so the reduced feature set effectively represents the original data space.
Linearity: All variables exhibit a constant multivariate normal relationship, and principal components are a linear combination of the original features.
Outliers: No significant outliers in the data as these can have a disproportionate influence on the results.
Large variance implies more structure: high variance axes are treated as principal components, while low variance axes are treated as noise and discarded.

Limitations

Model performance: PCA can lead to a reduction in model performance on datasets with no or low feature correlation or does not meet the assumptions of linearity.
Classification accuracy: Variance based PCA framework does not consider the differentiating characteristics of the classes. Also, the information that distinguishes one class from another might be in the low variance components and may be discarded.
Interpretability: Each principal component is a combination of original features and does not allow for the individual feature importance to be recognized.

Graphical Justification

Consider a distribution of data in two dimensions as given below -

Let’s consider that x and y in this distribution are correlated. Then a new property can be constructed by drawing a line through the center of the distribution.

This new property will be given by a linear combination 𝓌₁x + 𝓌₂y where each line corresponds to some value of 𝓌₁ and 𝓌₂.

The reconstruction error is defined as the average mean-squared distance between the original white dots and their projections (yellow dots) onto the principal components.

See how the reconstruction error varies.

Observe that the spread (variance) of the yellow projected dots reaches the maximum when the black rotating line lines up parallel to the direction marked by the pc1 vector. Also, at the same instant, the reconstruction error reaches its minimum. Hence we can interpret that “the maximum variance” and “the minimum reconstruction error” are achieved at the same time.

Note

In Linear Regression, the mean squared error which we reduce comes from the vertical projection of a point onto the line. Meanwhile, in PCA, the mean squared error comes from projecting the point perpendicularly onto the line, as indicated in the above figure.

Mathematical Justification

Covariance

Covariance is a measure of the extent to which corresponding elements from two sets of ordered data move in the same direction. We use the following formula to compute covariance.

Covariance Matrix

Variance and covariance are often displayed together in a variance-covariance matrix, (aka, a covariance matrix). The variances appear along the diagonal, and covariances appear in the off-diagonal elements, as shown below.

Proof

For simplicity purposes, we can assume that the data has been “centered” so that every variable has mean zero. Hence if we write the centered data in a matrix x, where rows are objects and columns are variables, and v is the covariance matrix, then -

Consider we have p-dimensional vectors, and we want to project them on to a line through the origin. The mean of the projections will be zero because the mean of the vectors is zero:

If we try to use our projected or image vectors instead of our original vectors, there will be some error. The difference is the error or residual of the projection given by the following expression -

Since w is a unit vector, w·w = 1. Adding those residuals up across all the vectors gives -

The first summation doesn’t depend on w and hence has no role in minimizing the mean squared residual. To make the MSE small, we can make the second sum big, i.e., we can maximize -

which we can see is the sample mean of the square of the dot product of w and x. The mean of a square is always equal to the square of the mean plus the variance -

Since we’ve just seen that the mean of the projections is zero, minimizing the residual sum of squares turns out to be equivalent to maximizing the variance of the projections.

In general, we project onto multiple principal components. If those components are orthogonal and have k unit vectors, then the image of x is its projection into space spanned by these vectors. Hence this image of x can be represented by-

The mean of the projection on each component is still zero. If we go through the same algebra for the mean squared error, it turns out that the cross-terms between different components all cancel out, and we are left with trying to maximize the sum of the variances of the projections onto the components.

Maximizing Variance

To maximize the variance, we can do our algebra in matrix form, so that the calculations are not too tedious. If we stack our n data vectors into an x (n × p matrix) and the weight matrix is w (p x 1 matrix), then the projections are given by xw (n ×1 matrix). The variance is

We need to choose a unit vector w so as to maximize the variance. To do this, we need to make sure that we only look at the unit vector that is we need to constrain the maximization. The constraint is that the square of the modulus of the vector w should be equal to 1. This can be done by introducing a new variable, the Lagrange multiplier λ. Adding λ times the constraint equation to our objective function, and then doing an unconstrained optimization allows us to maximize the variance taking into account the constraint. For our projection problem -

Setting the derivatives to zero at the optimum, we get -

Thus, desired vector w is an eigenvector of the covariance matrix v, and the maximizing vector will be the one associated with the largest eigenvalue λ

We know that the covariance matrix v is a p × p matrix, so -

It will have p different eigenvectors.
It is symmetric, and hence the eigenvectors are orthogonal to one another.

Projecting onto a lower dimension

The eigenvectors of v are the principal components of the data. Since they are all orthogonal to each other, so together they span the whole p-dimensional space.

The first principal component, i.e., the eigenvector, which corresponds to the largest value of λ, is the direction along which the data have the most variance. The second principal component, i.e., the second eigenvector, is the direction orthogonal to the first component with the most variance. Because it is orthogonal to the first eigenvector, their projections will be uncorrelated. In fact, projections on to all the principal components are uncorrelated with each other.

If we need to use q principal components, we sort the eigenvectors by decreasing eigenvalues and choose q eigenvectors with the largest eigenvalues to form a p × q dimensional matrix w where each column will be a different eigenvector (p x 1) of the covariance matrix v.

Using this weight matrix w (p x q), we can project our x matrix (n x p) into lower dimensions xw (n x q), essentially reducing the dimensions of data.

Applications of PCA

Image Compression

Image compression with principal component analysis is a frequently occurring application of the dimension reduction technique. Thus, principal component analysis can be used to reduce the dimensions of the matrix (image) and project those new dimensions to reform the image that retains its qualities but is smaller in k-weight.

As the number of principal components used to project the new data increases, the quality and representation compared to the original image improve.

Reducing Redundancies in data

Sometimes people gather tons of data with 20, 30, or more variables. They may think that they’re measuring 20 or 30 things, but they may be measuring fewer underlying factors (often called “latent traits”) than the number of measurements. Now, what happens when you have large amounts of features or if you keep on increasing the number of features in your data? A phenomenon known as the Curse Of Dimensionality begins to occur.

The gist is that the more features that you have, the more complicated your numerical computations become. This means that the performance of the machine learning algorithms, which are mathematical computations on the data, will be affected. Each added variable results in an exponential decrease in predictive power.

performance drops with a rise in features

Principal components are used to separate these essential underlying factors. When you run a PCA, you’ll see a component that points in the direction that accounts for the most variation in the n-variable space. The second component finds the next direction, perpendicular to the first, that has the second most variation. And so on for as many variables you have. Having n variables means that we have n components, but it’s likely that the first few important ones will account for almost all the variation.

References

An Introduction to Statistical Learning with Applications in R by Gareth M. James, Trevor Hastie, Daniela Witten, Robert Tibshirani
Python Machine Learning by Sebastian Raschka
Principal Component Analysis by Frank Wood
Principal Component Analysis MIT OCW

Bonus

For interactive visualization of Principal Components Analysis visit here