Principal Component Analysis: The Dragon (Data Scientist) Warrior’s Secret Technique

John Newcomb
8 min readNov 4, 2021

--

As a neophyte data scientist, you typically work with datasets that are (perhaps surprisingly) relatively small. For example, a 20-column dataset with 40,000 rows sounds quite large to the novice; however, this is a grain of sand on the beach in terms of what is possible. It is not totally uncommon in a more senior data position to find yourself dealing with hundreds of columns and hundreds of millions of rows.

On a small scale, it is possible to inspect your data column-by-column, and make intelligent decisions about which data to include in and drop from your project. But at a larger scale, there is no efficient way to inspect each column individually, to prioritize columns by information value, or to understand the relationships amongst the features in your dataset. This would take days at a minimum, if not weeks.

Additionally, the time complexity of model training and maintenance grows exponentially with each added column — the 21st column in a 20,000-point dataset could cost another 5–10 minutes in training time; however, extrapolated over the aforementioned hundreds of millions of rows, the 101st column could cost an extra twelve hours in training time.

With bigger datasets, it becomes critical that we eliminate as many unnecessary columns as possible, not only for timeliness, but also for the minimization of backend electricity bills and storage space expenditures. This obviously comes at the toll of information loss, but we can afford to lose some information as long as we reduce computational costs while maintaining some baseline standard of model performance, right? What if there was a way to make this sacrifice intelligently… Enter PCA — Principal Component Analysis.

Principal Component Analysis is what we call a ‘dimensionality reduction technique’ — a strategy used to reduce the number of columns in your dataset by compressing multiple columns down to a single column. On a low level, PCA creates a new column from an optimized linear combination of the original columns by applying a few linear transformations to the dataset and determining the relative information value per column before extracting information by column in proportion to the amount of information that column contains.

Now in English: A dragon is formed of parts from nine animals: the horns of a deer, head of a camel, eyes of the devil, neck of a snake, abdomen of a large cockle, scales of a carp, claws of an eagle, paws of a tiger and ears of an ox. Now imagine we have a nine-column dataset, and each column represents one of these animals. PCA comes in and processes these animal-columns, taking only the most important parts from each one — the paws from the tiger-column, the scales from the carp-column, etc. — and combines them to form a new dragon-column from the original nine. While yes, we did lose some parts from each animal-column in the synthesis of our dragon-column, we retained and combined the most influential and powerful pieces. And we now casually have a fire breathing, super-charged dragon-column (I should also note here, that you can select to return multiple ‘dragon-columns’, each additional column is weaker than the former, but still composed of pieces from each original column and resulting in overall information gain). On the back of our dragon-column(s), we may now fly through the model training process.Our backend storage and electricity bills are all significantly reduced, our modeling timeline accelerated. Remember, though, that this is at the cost of some predetermined amount of information loss that we are okay with.

If you want to know more about the mathematics behind PCA, I suggest the wikipedia page. For the sake of implementation, you need only know a few steps, as scikit-learn makes the process incredibly easy. We will now go into the process of PCA in python! For this, we will be using the little-known and rarely-used, iris dataset.

import pandas as pd
from sklearn import datasets
# load iris dataset into notebook; convert it to DataFrame
iris = datasets.load_iris()
df = pd.DataFrame(data=iris['data'], columns=iris['feature_names'])
df['target'] = iris['target']
# preview DataFrame
df.head()

First, we’ll split our DataFrame into our target and feature columns, the perform a train-test split — like always in machine learning preprocessing.

# split DataFrame into X (features) and y (target)
X = df.drop('target', axis=1)
y = df['target']
# perform train-test split on dataset
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

After preprocessing, we scale our data by centering each column using a StandardScaler object. There are some rare occasions where standard scaling your data ahead of time may not be advantageous or necessary, such as in the case of a sparse dataset, but usually, we will want to scale ahead of time. Note also that PCA automatically scales your data about the mean per column — so you only need a standard scaler if you wish to scale with respect to extremity as well.

# standard scale our data
from sklearn.preprocessing import StandardScaler
ss = StandardScaler()X_train_scaled = ss.fit_transform(X_train)
X_test_scaled = ss.transform(X_test)

Next, we instantiate our PCA object. The main parameter that matters on your PCA object n_components. When set to an integer, this tells the PCA object how many columns — known as ‘principal components’, or ‘PCs’ — to return after a transformation. Perhaps more useful, we can set n_components equal to a float point decimal, which will tell the PCA object to return the number of columns such that the float point decimal fraction of information is preserved in the return columns. If we want to take a fine-grain approach to the number of PCs to return, we may want to take a look at a graph that shows the diminishing returns on information gain for each additional PCA in order to determine the optimal number of PCAs to include — let’s take a look at that first.

# instantiate a PCA object and fit it to our train datafrom sklearn.decomposition import PCApca = PCA(random_state=42)
pca.fit(X_train_scaled)
num_features = [1,2,3,4]
explained_variance = pca.explained_variance_ratio_.cumsum()
import matplotlib.pyplot as plt
%matplotlib inline
plt.plot(num_features, (100*explained_variance))
plt.xticks(ticks=[1,2,3,4]);
plt.xlabel('Number of PCAs')
plt.ylabel('Percent Variance Explained')
plt.title('Percent Explained Variance by Number of PCs', fontsize=14)
plt.show()
PCA Calibration Curve

Explained variance can be thought of as the amount of information we preserve from the original dataset after applying the PCA transformation. We capture 6 percentage points of variance when we go from 1 PC to 2 PCs, but only about 1.5 percentage points from 2 to 3 PCs; and using 4 PCs, i.e. the original dataset, defeats the purpose and does not offer any massive information gain either. So it looks like 2 PCAs is our sweet spot, where we are greatly reducing the input information in comparison to only 1 PC while still maintaining a high degree of explained variance. Let’s go ahead and transform our X_train_scaled and X_test_scaled using n_components=2. Note that as a general starting point, you’ll want to preserve at least 80% of the information from your starting dataset as a baseline.

# transform original columns to PCs# option 1: choose the number of components to return
pca = PCA(n_components=2, random_state=42)
# option 2: choose the fraction of explained variance to capture
# pca = PCA(n_components=.8, random_state=42)
X_train_pca = pca.fit_transform(X_train_scaled)
X_test_pca = pca.transform(X_test_scaled)
# convert PCs to DataFrames to see what a PC looks like
X_train_pca = pd.DataFrame(X_train_pca, columns=['PC1', 'PC2'])
X_test_pca = pd.DataFrame(X_test_pca, columns=['PC1', 'PC2'])
# preview PCA dataset
X_train_pca.head()
PC columns built from linear combinations of original four columns

For the sake of example, today we will be using a kNN classifier parametric model. Import and instantiate a kNN for regular modeling and for PCA modeling so that we can demonstrate the time difference in modeling times.

# import and instantiate kNNs
from sklearn.neighbors import KNeighborsClassifier
knn_regular = KNeighborsClassifier(n_neighbors=5)
knn_pca = KNeighborsClassifier(n_neighbors=5)

Now, we will fit and score the model on the raw, non-PCA data, and find the time taken as well as the accuracy.

import timestart = time.time()knn_regular.fit(X_train, y_train)
accuracy_regular = knn_regular.score(X_test, y_test)
end = time.time()elapsed_regular = end - startprint('Regular Model Train and Test Time: ' + str(elapsed_regular))
print('Regular Model Accuracy: ' + str(accuracy_regular))
Output from above code snippet

Looks good!

No, we will do the same for the PCA data.

start = time.time()knn_pca.fit(X_train_pca, y_train)
accuracy_pca = knn_pca.score(X_test_pca, y_test)
end = time.time()elapsed_pca = end - startprint('PCA Model Train and Test Time: ' + str(elapsed_pca))
print('PCA Model Accuracy: ' + str(accuracy_pca))
Output from above code snippet

Wow — with no loss in accuracy, we were able to reduce our training and validation times significantly.

time_decrease_pct = (elapsed_regular - elapsed_pca) / elapsed_regularprint('We reduced time in training and validation by: ' 
+ str(round((100*time_decrease_pct), 2)) + '%')
Output from above code snippet

23.43%!! Now imagine if we applied this on a model that required several hours or even days to train… that’s a serious amount of time saved.

Finally, you might want to know how the PCA transformed your columns. After all, we still have no idea what happened when we applied PCA. We can take a look under the hood with the following code:

# look at the weights of PCA components
pca.components_
Output from above code snippet

We can interpret each number in each row of the array as the weights applied to each of the original columns in creating the corresponding PC. So, the equation used to calculate the first PC is something like:

PC1 = 0.52*col1 - 0.24*col2 + 0.58*col3 + 0.56*col4

We can understand each of these numbers as the relative of information importance per column — in this case, because the absolute value of the coefficient of the third column is the highest, we way say that the third column provides the richest information.

Before we close, it should be noted that PCA is not very important, powerful, or necessary unless you are dealing with larger datasets, because there is no good reason to sacrifice information if you’re not pressed on time. That said — as you move forward in your destiny as a data scientist, use this technique sparingly: As a true dragon warrior, you must rise to the occasion, but never go too far.

Originally published at http://github.com.

--

--