Understanding Principal Component Analysis Once And For All

Daniel Bestard Delgado
bluekiri
Published in
10 min readFeb 13, 2018
Photo by Roman Mager on Unsplash

Requisites

  • Basic concepts of statistics such as variance and covariance.
  • Basic knowledge of linear algebra such as matrix transposition, matrix inversion and diagonal matrices.

Introduction

Thanks to the astonishing advances of data analysis software in the last few years, data scientists have the possibility of using highly complex statistical methods by just typing the .fit() command in the prompt. This is very helpful for reducing the time needed to develop a project, but it can also have a dangerous drawback: the need to fully understand the statistical method behind .fit() does not seem to be that important anymore. However, what if the output of that function does not seem to make sense? What tools do we have to deal with this situation? Well, from my perspective there is only one way to deal with this scenario: understanding the basis of the statistical method that is behind the programming command.

This was me admiring the “magic” of the .fit() command when I first started using PCA without understanding its mathematical reasoning

More concretely, in this article I would like to explain one commonly used mathematical transformation in the Data Science field, called Principal Component Analysis or PCA, which is not transparent for those who do not understand the mathematical reasoning behind it. I am going to reduce the technicality to the minimum in order to make this article understandable to the maximum number of readers possible.

Notation

Let’s start with some notation before going into the explanation of PCA. Let X be a matrix of dimension n x p, where n is the number of observations of a given data set and p is the number of predictors. Now let S denote the covariance matrix of X, that is,

Remember that a covariance matrix is a p x p matrix where the diagonal elements correspond to the variance (dispersion measure) of the covariates in X and the off-diagonal elements correspond to the covariance (similarity measure) between two specific covariates. Keep in mind this straightforward detail: the covariance between a covariate X1 and a covariate X2 is the same as the covariance between X2 and X1. Therefore, the covariance matrix S is symmetric, which means the S is equivalent to its transposed version,

Motivating PCA

Imagine a scenario where p is very large and you would like to perform some kind of dimensionality reduction, that is, reduce the amount of predictors, in order to do some exploratory analysis or fit a machine learning algorithm. We all know that reducing the dimensionality of the problem can lead to decreasing the variance of any statistical model, which might compensate the increase of bias (bias-variance trade-off). However, one drawback of performing dimensionality reduction is that we might lose important information. At the end of the day, there is no way that reducing, say, from 1,000 covariates to 10, does not imply any kind of information loss. So the next question that raises is, how do we measure the amount of information that each variable has? Well, the variance of the covariate seems to be a good measure of information. The larger the variance the larger the amount of information the variable contains.

Let’s make sure all of us understand this point because it is critical to understand PCA. Why is a high variance of a covariate good? Well, assume, for example, you want to understand the effect of salary on a given outcome. Do you prefer to have a set of observations whose values for salary range from 1,000 to 2,000 or from 0 to 10,000? The larger the range of the variable salary, the more information we have about how such variable affects the outcome, right? Note that in the first scenario, we have no way of understanding the effect of salary on the outcome when the salary is smaller than 1,000 or larger than 2,000. The reason is that there is no data in such part of the space. Therefore, the larger the range of the covariate the more information. And, what is the effect of the range of a variable on its variance? Exactly, the larger the range, the larger the variance (the dispersion increases).

Some people tend to identify the word variance as something negative, which makes sense because there are several contexts where it can mean something bad. For example, not only the data has variance; estimators such as the sample mean also have dispersion measure, which is called standard error and is defined as the square root of the variance of the estimator. Of course in this case the variance is unwanted because the larger it is, the more uncertain the estimate is. Another example where the variance might be something bad is when it is present in the outcome variable. That is, if the variable that we want to predict has very high variance, it might be more complicated to make precise predictions (under some circumstances that will not be covered in this article). But, remember that, as explained before, predictors with high variance provide more information than predictors with low variance!

So far we have come up with a very intuitive explanation of why we want to the variance of the original predictors to be kept when performing dimensionality reduction. But, wait… is the variance of the predictors all we care about when measuring the amount of information? No! Think of this, what if you have two covariates that both have a high variance, but they are extremely similar to each other (say their correlation is close to 1)? In this case, what is the additional information that the second predictor provides with respect to the first one? The answer is that that the additional information is almost null. Therefore, when performing dimensionality reduction we not only want the transformed predictors to keep the variance of the original ones, but also to make them uncorrelated.

Once we understand how a dimensionality reduction should be, let me introduce you to the definition of PCA: PCA is a mathematical approach that transforms the matrix of predictors X into another one of the same dimension, call it Y, such that

  • the covariance matrix of Y is diagonal, meaning that all the transformed predictors are uncorrelated, and
  • the transformed predictors are sorted by a decreasing amount of information, meaning that the diagonal entries of the covariance matrix of the transformed predictors, which contain their variances (amount of information) as explained earlier, decrease as we move to the right (or down) throughout the matrix.

Is there a mathematical way to achieve such a transformation? Well, there is! Let’s deepen a bit into how PCA work mathematically without getting very technical.

Simplified Mathematical Development

Some of you might feel like this from this moment on. Stay with me, read it a second time if necessary and you will see it is not that hard

As we said before, let X be the original predictor matrix and Y the transformed predictor matrix, both of dimensions n x p. Remember the ideal properties that Y should have (uncorrelated predictors, which must be ordered in decreasing order of information). In order to transform X, we perform what is known in linear algebra a change of variable, which implies to multiply X by another unknown matrix P of dimensions p x p in order to come up with Y. That is:

So far we know nothing about P besides that it has to be invertible in order to reconstruct X. Hence, the goal is to find a matrix P that performs a change of variable with the ideal characteristics that we are looking for.

In order to move forward we have to make use of a linear algebra theorem called the Diagonalization Theorem which will not be proven (do not worry, this is this only moment where I make use of something that I do not explain where it comes from; you can look for a mathematical proof, but it will not be necessary to understand PCA). It says that a symmetric matrix, like the covariance matrix of X, also written as S, is diagonalizable as follows

where D is a diagonal matrix of dimensions p x p and the matrix P represents the same matrix used in the change of variable from above.

Note that we still do not know anything about P. Let’s show an important property of this matrix that will be used later on to come up with PCA. It turns out that P is an orthogonal matrix, which means that its transposed version is equivalent to its inverse version. That is,

Why does the previous equation hold? To show this, we need to remember that the covariance matrix is symmetric and, therefore, it is equivalent to its transpose

Let’s see how the needed condition that makes the previous equation hold is that P has to be orthogonal:

Hence, it has been proven that P is orthogonal given that it is the only way to prove that the covariance matrix of X is symmetric. But, how does this finding help up to fulfill creating uncorrelated predictors? To see this, let’s derive the covariance matrix of the transformed predictors:

In order to prove that the covariance matrix of Y is diagonal, which would mean that the predictors are uncorrelated, I am going to demonstrate that the last expression of the previous derivation is equal to a diagonal matrix. Let’s start:

The last mathematical issue to prove is that the total information of the principal components is the same as the total information of the original predictors. That is, let’s show that by applying PCA we have not lost information (we have just allocated it differently). Remember the definition of change of variable from the beginning. Given that we have proven the P is orthogonal we can write the change of variable as:

This is a special case called orthogonal change of variable, which allows the total variance of the data to be kept unchanged because the multiplication of an orthogonal matrix does not change the lengths of the vectors nor their angles (the vectors of an orthogonal matrix are orthonormal, which means that they have length one and they are perpendicular to each other). Therefore, the total variance (or total amount of information) or the original predictors can be written as:

where trace is simply the sum of the diagonal entries of a given matrix. Given that D contains all the variances of the principal components, their sum measures the total amount of information contained in X.

Last Clarifications

Done! We have just proven that PCA leads to uncorrelated predictors! This way, every transformed covariate contains unique information that the others do not. Each of these transformed covariates are called principal components. The first principal component corresponds to the first column of Y, which is also the one that has the most information because we order the transformed matrix Y by decreasing order of the amount of contained information (the first diagonal entry of the covariance matrix of Y is the highest one). Likewise, the second column of Y is called the second principal component. So on and so forth.

The next and final step is to decide how many principal components to use in our analysis. This decision will be different in each scenario. A common approach is to create a barplot with the amount of information that each principal components has and see if there is a point where including more principal components leads to very small increase in information which does not compensate the increase in dimensionality. This barplot has the following shape:

Conclusion

The goal of this article was not to explain all the technicalities about PCA. In fact, several critical concepts such as the role of eigenvectors and eigenvalues have not been mentioned even though they play an important role in PCA. The goal instead was to help beginners in PCA get an intuitive overview of what it does without losing ourselves too much in mathematical details.

In a future article I intend to:

  • deepen into the mathematical development by showing the role of eigenvectors and eigenvalues;
  • interpret the linear combinations that make the principal components;
  • talk about Principal Component Regression (PCR);
  • explain the interpretation of biplots which are simply a graphical representation of the most important principal components.

--

--