Principal Component Analysis and what is it really? (using Python)

Kelvin Li
Lambda School Machine Learning
4 min readApr 26, 2018
Wondering what it is

The Millennial Question

Remember back in the day, when you were in college and you were sitting in a math class learning about all that beautiful knowledge of the universe(or not so beautiful to some people). Then you ask yourself the most famous question in mathematics, “when will I ever need this in life?

“Equations written in chalk on a worn-out blackboard” by Roman Mager on Unsplash

Now that question I can’t really answer, but I am about to show you one truly mind blowing discovery with Linear Algebra that I have discovered in machine learning.

Principal Component Analysis (PCA)

In Linear Algebra, there is a topic called Eigenvalues and Eigenvectors.

The basic gist of finding the eigenvalues

The calculations to finding the Eigenvalues and Eigenvectors are relatively simple, but what are they exactly and what can we use them for? A great example of their usage would be Principal Component Analysis (PCA).

In Machine Learning, when we work with data that has multiple features, we run into an issue of being unable to visualize that many data at once. This is where PCA comes in handy.

This is just a piece of the titanic data set with a total of 9 features!

In this titanic data set, we are dealing with a total of 9 dimensions. To a human with below average intelligence like mine, it is simply too hard to visualize data in 9 dimensions. Hell, even 3 dimension was hard enough for me. With PCA, we are able to reduce the amount of dimensions needed to help us visualize the data.

On a matrix it would look something like this:

Notice how big this matrix is and how it’s only a piece of our data 😑

Now what PCA tells us to do is the following:

  1. Calculate the Covariance Matrix of this data set.
  2. Calculate the Eigenvalues and Eigenvectors of the resulting Covariance Matrix.
  3. The resulting Eigenvector that correspond to the largest Eigenvalue can then be used to reconstruct a large fraction of the variance of the original data.

Luckily this sort of calculation can be done by the sklearn package, so it saves us some tedious work:

import pandas as pd
import seaborn as sns
from sklearn.decomposition import PCA
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import normalize
titanic = sns.load_dataset('titanic')
titanic = titanic.drop(['alive','adult_male','who','class','embark_town'], axis=1)
titanic['embarked'] = titanic['embarked'].fillna(method='ffill')
titanic = titanic.drop(['deck'], axis=1)
titanic['age'] = titanic['age'].fillna(method='ffill')
for label in ['embarked','sex']:
titanic[label] = LabelEncoder().fit_transform(titanic[label])
labels = titanic['survived']
features = titanic.drop(['survived'], axis=1)
# PCA basically tries to pull everything together while maximizing variance between each point
model = PCA(n_components=3)
model.fit(features)
X_3D = model.transform(features)

After doing PCA on the dataset, getting the Covariance matrix becomes child’s play:

print('Covariance Matrix')
print(np.round(model.get_covariance(),2).astype(int))

And we get:

I rounded the entire matrix so that it’s more pleasant to the eye

We can also obtain the Eigenvalues and Eigenvectors:

print('Eigenvalues')
print(np.round(model.explained_variance_,decimals=1))
print('Eigenvectors')
print(np.round(model.components_,decimals=2))

And we get:

I also rounded these so that it is more pleasant to the eye

The Eigenvectors points the directions of greatest variance in the data while the Eigenvalues are the variances of those particular Eigenvectors. So the combination of the two describes the overall shape of the data.

Now the final step is to plot the graph which I used Plotly to do:

import IPython
import numpy as np
import plotly.graph_objs as go
from plotly.offline import download_plotlyjs, init_notebook_mode, iplot
def configure_plotly_browser_state():
display(IPython.core.display.HTML('''
<script src="/static/components/requirejs/require.js"></script>
<script>
requirejs.config({
paths: {
base: '/static/base',
plotly: 'https://cdn.plot.ly/plotly-1.5.1.min.js?noext',
},
});
</script>
'''))
configure_plotly_browser_state()
init_notebook_mode(connected=False)
live_x, live_y, live_z = X_3D[labels == 1].T
dead_x, dead_y, dead_z = X_3D[labels == 0].T
alive = go.Scatter3d(
x=live_x,
y=live_y,
z=live_z,
mode='markers',
name='Survivors',
marker=dict(
color='blue',
size=3,
opacity=0.9
)
)
dead = go.Scatter3d(
x=dead_x,
y=dead_y,
z=dead_z,
mode='markers',
name='Casualties',
marker=dict(
color='red',
size=3,
opacity=0.9
)
)
layout = go.Layout(title = 'Titanic 3D')
fig = go.Figure(data=[alive, dead], layout=layout)
iplot(fig)

Lo and behold:

The colors show who survived and who did not

Waalah! We were able to plot a 3D graph using the 9 dimensional titanic dataset that we had. Impressive isn’t it?

A majority of times, the new graph generated may tell you something or it may tell you absolutely nothing. It’s up to the you to decide which technique you want use to analyze the data.

Layman Summary

Now to someone who never took a Linear Algebra or Statistics course, this may seem like mumbo jumbo to you. In short, PCA was able to sieve out the huge dataset and turn it into a 3 dimensional graph based off of a certain measure that we call the Covariance matrix. This allows us to visualize the data and come to see if there are any pattern happening.

Reference and Links

  1. Special thanks to Jasper Jenkins for providing me the code for the legend of the graph.

--

--