Introduction to PCA (Principal Component Analysis)

9 min readDec 6, 2018

PCA is a commonly used dimensionality-reduction technique: It can be used to reduce the dimensionality of your data-set in n-dimensions while retaining as much information possible.

Imagine you are working on a data science project. You’ve downloaded your data-set, visualized it, cleaned it up, engineered new features, and are now going to fit your ML algorithm on the data-set.

…

Days pass. The algorithm is still not done. You’re getting frustrated and want a solution to this issue. One technique utilized for situations like these is PCA.

Why use it?

PCA is known as a Dimensionality-Reduction technique. If you haven’t heard of Dimensionality with regards to data, here is the low-down;

Dimensionality just refers to how many features(a.k.a columns or attributes) are in the data. We think of data in terms of Dimensional Space like the 3rd dimension in which we reside or the 2nd dimension which is any flat shape.

The more features you have → The more dimensions your data has

Now what happens when you have large amounts of features or if you keep on increasing the number of features in your data? A phenomena known as the Curse Of Dimensionality begins to occur.

The gist is that the more features that you have, the more difficult your numerical computations become.

This means that the performance of the machine learning algorithms, which are numerical computations on the data, will be affected. Each added variable results in a exponential decrease in predictive power.

Analogy Example

One Dimension:To give you a better idea of the Curse of Dimensionality. Imagine you are on a grass field. Let’s say there is a single white line in the middle going from the top to the bottom of the field. Someone drops a penny on that white line. It is a pretty trivial task to find the penny. All you need to do is just walk up the white line; it will probably will take under 5 min.

Two Dimensions:But now, let’s say that there is a football field to the right and also to the left of that white line. Someone drops the penny between those 2 fields. Now finding the penny has become a lot harder. It might take hours or even half the day.

Three Dimensions:Now, let’s build a 100 story building using those 2 fields as the foundation. Someone drops a penny in that building. You have to search each floor, which inconveniently is the combined length and width of those 2 football fields. It will probably take you more than a month to search all the floors.

You can see that the more dimensions we added to our problem, the harder it became to solve. This is essentially the concept of the Curse of Dimensionality.

So what does PCA have to do with any of this? Well, PCA is a Dimensionality Reduction technique which aims to simplify the understanding of your data: numerically or visually.

What is it?

Before we start talking more about PCA, let’s talk about Dimensionality-Reduction techniques.

In general, these techniques aim to reduce the number of features columns in your data-set. It does this by combining and reorganizing your feature columns or by outright removing them.

If you have worked on data with many feature columns, you know that it raises many questions. Do you understand the relationship between each feature (very difficult if the number is high)? Does the number of features cause over-fitting in your models? Is this going to violate my model assumptions? When will my model finish training?

Due to these slew of issues you now begin to ask yourself: “How can I reduce the number of features, so that I will not have these issues?”

There are 2 ways to do this:

Feature Elimination
Feature Extraction (PCA, t-sne, LLE, etc.)

Feature Elimination is just that. You eliminate/drop features you don’t want. However, you lose that data from your model.

Feature Extraction doesn’t have the problem above. Let’s say you have 100 features. In feature extraction, we drop the old features and create 100 “new” independent features, where each “new” feature is a combination of the previous ones. The independent “new” features are determined by the algorithm you use, such as PCA, and ordered by how well they predict our Target Variable, i.e. the dependent variable. Finally, you drop the least important “new” columns based on how much variance you want to maintain: E.g. Let’s say the first 80 features from the example above explains 95% of the variance, and we want to keep that percentage. This means we would drop the last 20 “new” features.

In the end, your data-set should look something like this

Now, let’s finally talk a bit more about PCA.

PCA is a feature extraction technique; where, the principal components are the “new” independent features which we discussed above. The goal is to keep as many of the “new” features as possible while dropping the least important ones. This is easy to determine since the new features are ordered based on how well they predicted the Target Class. You might now be asking: “Well, we reduced the number of features, but didn’t we lose data in the process just like in feature elimination?”. It is true that we dropped a portion, but every “new feature” is a combination of our “old features”. This means we are still keeping the most valuable parts of our old features, even if we drop the some of the “new” features.

Another interesting tidbit to note: all these “new” features are linearly independent. The reason being is because before you apply PCA, you need to standardize your data-set to the normal distribution. This means we can guarantee a linear data-set for a linear model such as Linear Regression, Logistic Regression, Linear SVM.

How do we get the principal components?

PCA finds the first principal component by identifying the axis that accounts for the largest amount of variance in the training set.

Next, it finds the axis that accounts for next largest amount of variance, which is orthogonal to the previous axis. And it goes so and forth, until you have as many principal components as the dimension space you are working in.

An example in a 2-D space can be seen below.

We choose the first principal component, i.e. [PC1], based on the vector with the largest variance and smallest errors. We can visually see this on the plots to the right. The solid line shows us PC1 has the largest variance and smallest error rate, then the dotted line, PC2, the line orthogonal to PC1, has the largest amount of remaining variance. The dashed line with no label is not a principal component and just represents a axis that shares in both the data from PC1 and PC2.

In the current scenario, where we are working in a 2-D space, two principal components should explain ~100% of the data. Remember, earlier I explained that the principal components are combinations of the old features. Therefore, if I have 2 principal components in a 2-D space, they essentially contains the same information as the original 2 features, but reordered. If we were working in a 3-D space and only used 2 principal components, then the combination of those components would explain < 100% of the data.

Below, is an animation to show you how different projections onto the hyperplane look like. The purple tick represents where PC1 is and the dots on the line represent the variance of data on that line.

For machine learning optimization, you will want to choose the first n-components with a combined variance of around~85% or higher. For visualizing high-dimensional data, this does not matter as much; you can just set up 2 principal components to look at your data on a 2-dimensional space. You will see what I mean in the code example below.

Practical Coding Example

PCA Visualization

I will be using the famous Iris data-set to visualize the different type of irises.

First, I import the packages I am going to use.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib as mpl
import seaborn as sns
%matplotlib inline

Next, I will read in the data from a url and store it as a pandas dataframe

url = "https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"column = ['sepal length', 'sepal width', 'petal length', 'petal width', 'target']iris = pd.read_csv(url, names=column)iris.head()

Now, I scale the values to a normal distribution and implement PCA. The reason we use the standard scaler is because PCA works best with data that has a normal distribution. In fact, most if not all ML algorithms work best with data that follows a bell curve.

# the libraries required to implement 
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA# We put our feature data in X_feat and the target feature in y_feat
X_feat = iris.drop('target', axis=1)
y_feat = iris['target']# Created a scaler obj which then fits and transforms the feature
# data   
scaler = StandardScaler()
X_feat = scaler.fit_transform(X_feat)# n_components=2 because we want to view the data in 2 dimensions
pca = PCA(n_components = 2)
principal_components = pca.fit_transform(X_feat)

Let’s put the principal component data in a pandas DataFrame

principal_Df = pd.DataFrame(data = principal_components, columns = ['principal component 1', 'principal component 2'])principal_Df.head()

It’s time to visualize the data in 2 Dimensions.

final_df = pd.concat([principal_Df, iris['target']], axis=1)fig = plt.figure(figsize=(8,8))
ax = fig.add_subplot(1,1,1)
ax.set_xlabel('Principal Component 1', fontsize=15)
ax.set_ylabel('Principal Component 2', fontsize=15)
ax.set_title('2 Component PCA',fontsize=20)targets = iris['target'].unique()
colors = ['r', 'g', 'b']
for (target, color) in zip(targets,colors):
    ax.scatter(x=final_df.loc[final_df.target == target, 'principal   component 1'], y=final_df.loc[final_df.target == target, 'principal component 2'], c=color)ax.legend(targets, loc=4)
ax.grid()

Finally, let’s see how much variance was preserved, since there will be some loss of data when we move from 4 to 2 dimensions.

# This tells us how much of the variance was can be attributed to
# each of the principal components
pca.explained_variance_ratio_  # array([0.72770452, 0.23030523])

It seems as though the combined variance of PC1 and PC2 is about 95%

Final Thoughts

When to use PCA

Data Compression: This is probably the most common use for PCA. When you are working with large data, you can run into Out-Of-Memory issues or astronomically large computation times with your machine learning algorithms. So, a common solution is to decompose your feature data into a lower dimension, while preserving most of the variance.
Visualization: It is a good tool if you want to visualize your data-set into 2 or 3 dimensions. I personally think it is useful with unsupervised tasks: the reason being that PCA is a pretty simple way of finding different clusters/classes in a high-dimensional data-set.
If feature variable interpretation doesn’t matter: Principal Components usually complicate understanding how each original feature affects the Target Class. Let’s say we have 3 features: age, income, and height. We use PCA and get 2 principal components, PC1 and PC2. Both principal components are now combinations of 3 previous features. This means if we fit a linear regression model by regressing Y on PC1 and PC2, we are able to interpret what a one-unit increase in PC1 or PC2 does. However, that does not translate to what a one-unit increase in age, income, or height will do.

When not to use PCA

Just to fix over-fitting: Over-fitting is usually caused by too many features. Large amount of features usually means that your cross-validation error will be high, due to high variance of the training data. PCA reduces the number of dimensions of your data, while trying to preserve the majority of the information. This means that you might still over-fit your data. Better alternatives are feature elimination and regularization.

Also extra, extra side-note: Make sure the data your ML algorithm ingests is good in the first place. Trash Data → Trash Results