From Projections to The Principal Component Analysis

9 min readApr 27, 2022

The occurrence of an event might not be always explained by all the factors associated with it. The event might have occurred only due to a few core reasons or maybe the event might have occurred due to several reasons each contributing its share. When we try to model real-life problems with data, the data might be unusually large in size having a lot of variables (also called features).

The core idea behind principal component analysis is to combine these variables into a lower-dimensional variable in such a way that it does not decrease the “information” provided by the original set of variables drastically. This in turn reduces the dimensionality of the data and leads to an improvement in overall computation time.

Although we will not get into serious math in this article the reader is still expected to have a basic understanding of vectors, matrix multiplication, eigenvalues, and eigenvectors. We start by exploring the idea behind vector projections, variance, and covariance and then lead up to the principal component analysis.

What are projections?

Let X (yellow) and Y (blue) be two vectors in a vector space shown in the animation below. The projection of X onto Y can be thought of as a perpendicular shadow of the vector X on Y. Why the perpendicular shadow? If we take a close look, we will find that this perpendicular projection is the one that minimizes the “error” that occurs when projecting.

This error is represented by the white line in the animation. The resultant vector in Z (in Pink) is the projection of X on Y. The vector Z is just a scaled-down version of the vector Y. One thing to note is that if we try to project a vector onto itself, the projection would be that vector itself.

The general formula for projection of a vector onto another vector is given by:

Eq1: Projection of a vector onto another vector.

Variance and Covariance

The variance can be thought of as a measure of how much that data is spread around its mean. If we have a set of observations, let’s say [-1,0,1], this dataset has a mean of 0 and a variance of 1 as it is spread with a magnitude of 1 around both sides of its mean. The variance is only defined in one dimension and can be calculated using the following formula:

Note: The above formula is defined for what we call the population variance, there is something called sample variance too, which is essentially the variance of a subset of data and involves taking N-1 observations in the denominator instead of N. In this article, we will be dealing with population variance and population covariance everywhere.

In higher dimensions, to capture the co-variability of the data we define something called the covariance. The covariance captures the relationship between each pair of dimensions in our data. If we take an example of the case given below, to capture the variability of how the data in the x-component jointly varies with the data in the y-component we need to calculate the covariance of x with y.

Example: Data

If Cov(x, y) is a large positive number, they have a strong positive covariance, which in turn means as x increases y increases too. If the Cov(x, y) is a large negative number, they have a strong negative covariance, which in turn means as x decreases y decreases too. If the covariance is very close to 0, there is not much evidence of any kind of relationship between x and y.

The covariance of a dataset can be calculated using the following formula:

Eq3: Formula to calculate covariance.

The variance is just the covariance of a specific component with itself. This notion further helps us define the covariance matrix. The covariance matrix is a square symmetric matrix that contains the covariance of each pair of dimensions present in the data. The case below is the covariance matrix when the data is 2 dimensional. The matrix has the variance of each dimension along its diagonal.

Example: 2D Covariance Matrix.

The covariance matrix in a general case is an NxN where N is the dimensions in our data and can be calculated using the formula below.

Eq4: Formula to calculate The Covariance Matrix.

What is dimensionality reduction?

If you were to transport a file so large in size that no ordinary storage device could store it you would naturally have to find a way to reduce its size in such a way that you lose as less information as possible while doing so. The idea behind dimensionality reduction also follows along the same line. Data can be represented as a matrix and this matrix can be multiplied with another matrix that projects our data onto a lower dimension. From this lower-dimensional representation of our data, we try to reconstruct our original data. The difference between the original data and the data reconstructed from the lower-dimensional representation is called the reconstruction error. The goal is to minimize this reconstruction error.

The idea behind PCA

The idea behind PCA can be formulated in two ways, minimizing the reconstruction error or maximizing the variance of our data points on the lower dimension we project it onto.

1. Minimizing the reconstruction error

Our data lies in a higher dimension than what we project it onto. We want to project in a way that when we try to get back to the original data there is as less error as possible between the original representation and the reconstructed one. For example, in the figure below we have 2-dimensional data and we try to project it onto a 1-dimensional line.

If we choose the line in the first case we see that the data points tend to lie really far away from their projected counterparts, this would in turn mean that when we try to reconstruct our original data from the 1-dimensional representation there would be a lot of error in this case. If we choose the line in the second case, we see that the data points tend to lie comparatively close to the line we are projecting onto, this would in turn mean that when we try to reconstruct our original data from the 1-dimensional representation, the error would be less than the one in the first case, and that is exactly what we want to minimize this error.

Eq5: Minimizing The Reconstruction Error.

This then turns into an optimization problem in which we minimize the function which spits out the reconstruction error. Although we won’t get into the math here the reconstruction error turns out to be minimized along the eigenvectors corresponding to the largest eigenvalues. We explore this idea of projection along eigenvectors with the largest eigenvalues in the next part.

2. Maximizing the variance

When we have a dataset with a lot of features it is not possible to explore all those features visually and try to find the most important ones.

Reducing the dimensions of the data can also be thought of as combining certain features of the data to make a new feature along a lower dimension that preserves as much information as possible about the original data. These new combined features are called the principal components of the data. The goal is to find these principal components so they preserve as much information as possible.

Looking at the example below, there are a bunch of 2-dimensional points and the data tends to follow an increasing trend. We project those points onto a 1-dimensional line to give us a new feature. In the first case, when we project the data onto the 1-D line and see that line individually we get to know almost nothing about the original data, the information that the data was following an increasing trend is lost.

In the second case, we see, when we project the data onto the respective 1-dimensional line, the projected data seems to be well spread out on it and if we look at the 1-D line individually we are able to see that the projected data to seems to follow an increasing trend too, but now we just need 1-dimension instead of two to explain it. The principal component in the second case seems to capture the “variance” in our original data much better than the principal component in the first case.

How do we find these principal components? We want to project the data onto a lower-dimensional representation “u_1”, such that the variance of the data is maximized, here “u_1” is a unit vector. Firstly, we center our data around the origin by subtracting the mean vector from each data point.

Example: Mean centering data (3D Case) can be extended to N-dimensions.

Now we use the formula for projections we saw earlier to project data onto a hypothetical principal component “u_1”.

Since u_1 is a unit vector here the term uTu would be the norm of that vector which in the case of unit vectors is 1. Now that the data has been projected we need to maximize its variance. Maximizing the variance just means the magnitude by which we scale our unit vector u_1 while projecting. The variance is given by:

If we look closely, the thing we see in the middle is vTv, and since our data already has a mean of 0 this vTv would turn out to be the covariance matrix!. This then turns into an optimization problem in which we maximize the variance with a given constraint that the norm of u_1 remains to be 1. Turns out solving this maximization problem gives us the following equations along which variance is maximized.

The lambda here corresponds to the eigenvalue and since this was a maximization problem this tells us that the variance is maximized along the eigenvectors.

If we left multiply uT we see that the variance is equal to the eigenvalues and hence we choose the eigenvectors corresponding to the largest eigenvalues to maximize the variance.

Here we just found out the principal component 1, likewise, we can find out more principal components by extracting the subsequent eigenvalues and eigenvector pairs. Since the covariance matrix will always be symmetric these principal components will always be orthogonal to each other.

Although we don't explore the math behind minimizing the reconstruction error in this article the formulation of PCA in which we minimize the reconstruction error and the formulation in which we maximize the variance are one and the same. Solving the minimization problem of reconstruction error and maximization problem of variance gives us the same results.

Conclusion

In this article, we started by exploring the idea behind projections, variance, covariance, and dimensionality reduction. We then saw how PCA helps in reducing dimensions of our data whilst preserving maximum information by finding these “principal components” of the data. We saw that the idea behind PCA can be formulated in two ways, minimizing the error between our dimensionally reduced and our original data and then how maximizing the variance of data while projecting it onto a lower-dimensional representation would give us the same results.

The algorithm behind PCA can be summed up in the following points:

Centering the data: Finding out the mean vector of the data and subtracting it from each data point.
Finding out the covariance matrix of the data.
Eigen Decomposition: Finding out the largest eigenvalues and the corresponding eigenvectors of the covariance matrix.
Projection: Projecting the data onto the eigenvector corresponding to the largest eigenvalue. This gives us the corresponding principal component of the data.