Principal Component Analysis (PCA)

Conceptual deep dive with step-by-step implementation in numpy and sklearn.

Andrea Yoss
Analytics Vidhya
9 min readApr 12, 2020

--

“Finding patterns is easy in any kind of data-rich environment… the key is in determining whether the patterns represent noise or signal.”

— Nate Silver

Introduction

Bias-Variance Tradeoff

A typical issue students run into when fitting a model is balancing the model’s bias with its variance, known as the bias-variance tradeoff.

Bias is essentially a measure of “badness” — the higher the bias, the worse your model does when using the very data it was trained on. Typically, a model has high bias and is considered “underfit” when the model performs poorly on the training data because it is does not have enough information to work off of. A good way to combat an underfit model is by adding more features to the model.

Unfortunately, adding additional features to the model increases its complexity, also known as its variance. While increasing the complexity of a model often increases its performance on the training data, it can be at the expense of how it performs on data it has never seen before (i.e., testing data). When a model performs much better on the training data compared to the testing data, the model is said to be “overfit.”

Another way of thinking about the bias-variance tradeoff is to think about it with respect to dimensionality. We can reduce the complexity in a model through dimensionality reduction — that is, through removing features from your model. While dimensionality reduction causes you to lose some information about your training data, thus increasing the bias, the process allows you to remove “noise” from the “signal,” and combat overfitting your model. Additionally, dimensionality reduction speeds up training speed and increases your ability to visualize your data.

Types of Dimensionality Reduction

There are two types of dimensionality reduction: feature elimination and feature extraction. In feature elimination, you drop weak features from your model. By eliminating features that are not strong contributors to the overall success of your model, you are left with a reduced subset of your original features. LASSO regularization is an example of feature elimination.

In feature extraction you are extracting the most important features from a set of new features, which are linear combinations of your original features. Principal Component Analysis (PCA) is a form of feature extraction.

Principal Component Analysis (PCA)

Principal Component Analysis (PCA) is an unsupervised learning method that finds linear combinations of your existing features — called principal components — based on the directions of the largest variance in the input space. These combinations are made in such a way to ensure the principal components are linearly independent from one another. Based only on extracted “important” principal components, we are able to transform the original features onto a new, lower dimension subspace, without significant information loss.

In PCA, a principal component’s importance is determined by how much information (variance) is captured by that relationship. So, the 1st principal component explains the largest amount of variance, followed by the 2nd principal component which explains the second highest amount of variance, followed by the 3rd principal component…through the kth principal component.

To identify the principal components, we compute the eigenvectors and eigenvalues of the covariance matrix.

Did I lose you there? Okay- let us take a step back…

Computing Eigenvalues and Eigenvectors

An eigenvector of a square, nxn matrix, A, is a non-zero vector whose direction remains unchanged when a linear transformation is applied to it. When this mapping occurs, the eigenvectors are scaled by a factor, known as its eigenvalue. All eigenvectors of A are linearly independent of one another.

To understand this concept, let’s say we have a non-zero, nxn matrix, A. There exists a non-zero vector, v, such that:

where, A is an nxn transformation matrix, v is the eigenvector, and lambda is the corresponding eigenvalue.

To solve for the eigenvalues and eigenvectors, we first simplify the above expression.

Note that I is the identity matrix, which is an nxn matrix of 1’s on the diagonal, and 0’s everywhere else. A property of the identity matrix is that the product of any nxn matrix, A, and the nxn identity matrix, I, is the matrix, A.

The only way for (3) to be true is if its determinant — the actual area of change caused by the transformation — equals 0. This is known as the characteristic equation. Once we solve the characteristic equation for the possible eigenvalues, we can solve for their corresponding eigenvectors by plugging our eigenvalues into (3).

Let’s look at a simple 2x2 matrix to show this in action!

  1. Solve for the eigenvalues of matrix A.

2. Use these eigenvalues to solve for their corresponding eigenvectors.

3. We can check to make sure these are correct.

And that’s all there is to eigenvalues and eigenvectors! Not too bad, right?

Step-by-Step PCA in Python

Now let’s apply this to PCA using a dataset that most of you are familiar with: the Iris flower dataset from sklearn.

Before we start, we need to import some libraries and put the iris data into a dataframe.

STEP 0: Standardize features.

Because PCA is sensitive to the magnitudes of the features, is it important to standardize your features so everything is on the same scale. We can standardize features by removing the mean and scaling to unit variance.

METHOD 1: numpy

STEP 1: Calculate the covariance matrix.

Recall that the covariance between two features is how those variables are associated with one another. Also, the covariance of a feature with itself is just the variance of that feature.

cov(x,y) and cov(x,x)

To construct the covariance matrix, you can think of the columns as a list of the features and the rows as the same list of features. Each element of the matrix is the covariance between the column feature and row feature. Think of what this would like like for features x, y, and z.

EX: Covariance Matrix for Three Features: x, y, z

Remember, the covariance between any feature and itself is just the variance of that variable.

While I am sure we could now easily calculate the covariance matrix by hand, fortunately, we do not have to, as we can just use numpy’s covariance function as follows:

STEP 2: Calculate eigenvectors/ eigenvalues from covariance matrix.

STEP 3: Sort eigenvectors by their corresponding eigenvalues in decreasing order.

The eig() function from numpy linalg actually sorts this for us which is pretty neat. I printed out the [sorted] eigenvalues and associated eigenvectors in the screenshot below.

STEP 4: Choose your eigenvectors using the proportion of explained variance.

The explained variance tells us how much variance is captured by each eigenvalue/ principal component. You can calculate the explained variance of each principal component using the formula:

where p is the dimensionality of your original features.

The explained variances and explained cumulative variances for each principal component are also summarized in the table below.

*explained variances sum to 101% in table due to rounding

From the table, we see that the first principal component (PC 1) explains 73% of the variance in our original data, our second (PC 2) explains 23% of the variance in our data, our third (PC 3) explains only 4% of the variance in our original data, and finally, our fourth (PC 4) explains a mere 1% of the variance in our original data. Given this information, we should feel comfortable setting k=2, and keeping the first two principal components and dropping the last two, as together, the first two principal components explain 96% of the variance in our original data.

STEP 5: Transform original p dimensional data into k dimensions.

Since we are only keeping the first two features in our model, we can transform the original 4-dimensional data into 2 dimensions by multiplying the transpose of our feature vector by the transpose of our scaled, original data.

METHOD 2: sklearn

Once you have fit PCA on your scaled data, you can investigate the explained variance to determine which features to keep. I have included a screenshot below to show our output when we run the above code, and use some of the methods within sklearn’s PCA.

Looks familiar, eh? These are the same values we got in steps 3 and 4 above for the eigenvalues, eigenvectors, and explained cumulative variance!

We can now fit and transform PCA on the original data using our first two principal components.

And thats all there is to it!

Last update: April 14, 2020

Resources

  • Strang, Gilbert. Linear algebra and its applications. Belmont, CA: Thomson, Brooks/Cole, 2006. (This was the textbook I used in Linear Systems Theory (ESE 500), a class I took in 2013 at the University of Pennsylvania while pursuing my MSE.)
  • Gilbert Strang. 18.06 Linear Algebra. Spring 2010. Massachusetts Institute of Technology: MIT OpenCourseWare, https://ocw.mit.edu. License: Creative Commons BY-NC-SA. (This is a great resource from the same author as the textbook above, Gilbert Strang. As a visual learner — as you might have gathered from my handwritten formulas and calculations in this article — I found his video lectures to be very thorough.)
  • Sanderson, Grant [3Blue1Brown]. Playlist [Essence of linear algebra]. Retrieved from https://www.youtube.com/playlist?list=PLZHQObOWTQDPD3MizzM2xVFitgF8hE_ab (These youtube videos are fantastic. Sanderson does an excellent job explaining key linear algebra concepts like 3-D linear transformations and eigenvectors/ eigenvalues, for example.)

--

--