Dimension reduction with PCA for everyone
Brief Introduction
Objective-: The objective of this article is to explain dimension reduction as a useful preprocessing technique before fitting to a model and showing the workflow in Python.
Use Case-: Sometimes while building a predictive model you can come across datasets with large number of columns/features also known as dimensions. Finding out which columns have the most predictive power can be difficult in these circumstances. This is where dimension reduction is useful
Intuition behind Dimension Reduction-: The best way to explain the concept is via an analogy. When we build a a house we use blueprints on paper. The blueprints are a 2 dimensional representation of a 3 dimensional house which convey all the important information about the 3 dimensional house in 2 dimensions. Mathematically speaking we have projected 3 dimensions onto 2 dimensions while preserving the most important information. We can extend this operation to n (many dimensions). PCA (Principal Component Analysis) can take a dataset with n(eg. 50 columns) dimensions and reduce it to 2–3 dimensions known as principal components. Each principal component is a combination of all n columns represented as 1 column. Basically using PCA we create 2,3 or 4 new features that capture the essence of the information contained in the dataset with high dimensions or columns .Typically the 1st 2 or 3 components capture most of the variance in the dataset. This is a simple explanation, more mathematical explanations using linear algebra are available on Youtube here.
Implementation in Python
Now that we know the intuition behind dimension reduction let’s see how we would apply it in a practical setting on a dataset. We will use the breast cancer dataset. The dataset has 31 features/measurements like radius, concavity, compactness etc. and the variable we want to predict is whether the tumor is cancerous or benign.
We start with a few pre-processing steps.
- First we encode the response variable using label encoder
- We then create a features array which only includes the features and not the response variable
- We also do some exploratory data analysis on the response variable
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import LabelEncoder
from sklearn.decomposition import PCA
cancer=pd.read_csv('https://assets.datacamp.com/production/repositories/1796/datasets/0eb6987cb9633e4d6aa6cfd11e00993d2387caa4/wbc.csv')
# Encoding the response variable
le=LabelEncoder()
cancer['diagnosis']=le.fit_transform(cancer.diagnosis)
y=cancer['diagnosis']
# Dropping response variable and the unnamed column
features=cancer.drop(columns=['Unnamed: 32','diagnosis','id'])
# Checking the distribution of the response variable
cancer.diagnosis.value_counts(normalize=True)
As we can see above 62% of the cases in our dataset are benign and 37% are cancerous. This will be useful when we build a model.
Now our first step is to find the number of Principal Components that explain most of the variance in the dataset. This dataset has 30 features/Dimensions right now. The PCA algorithm will create new features which are a combination of these 30 features. As a first step we create 30 Principal Components. Generally the first 2 to 5 Principal Components explain most of the variance in the data. Python makes the process simple because the PCA package has an associated method called explained_variance_. This process is also known as finding the intrinsic dimension of the data
# Building a pipeline with the pca and scaling steps
scaler=StandardScaler()
pca=PCA()
pipeline=make_pipeline(scaler,pca)
# Fitting to the features array
pipeline.fit(features)
# Now lets look at the Principal Components
pd.DataFrame(pca.components_,columns=features.columns).head()
The above figure shows the first 5 Principal Components (rows) and how they are constructed from the existing features. You can think of the numbers as roughly analogous to the weights that each component puts on a feature.
Now lets plot all the Principal Components vs their explained variance to find the intrinsic dimension of the dataset.
# Using the explained variance method and plotting the Principal Components vs
# explained variance
feat=range(pca.n_components_)
explained_variance=pca.explained_variance_
fig,ax=plt.subplots()
ax.bar(x=feat,height=explained_variance)
ax.set_xlabel('Number of components')
ax.set_ylabel('Explained Variance')
plt.show()
It is clear from the figure above that the first 5 components are responsible for most of the variance in the data. Thus in this case the intrinsic dimension of the data is 5. We can thus reduce the number of dimensions of this data from 30 down to 5.
Now that we know the intrinsic dimension of the dataset is 5 we will transform the features into these 5 Principal Components. Every observation will now have 5 features rather than 30, the 5 components being a combination of the 30 features.
# Fitting pca with 5 components (intrinsic dimension)
pca=PCA(n_components=5)
scaler=StandardScaler()
pipeline=make_pipeline(scaler,pca)
pipeline.fit(features)
# Now transforming to have the PCA columns for each observation
transformed=pipeline.transform(features)
# Looking at the PCA dataframe
X=pd.DataFrame(transformed,columns=['PCA1','PCA2','PCA3','PCA4','PCA5'])
Now we ill fit the transformed features and the diagnosis variable to a Decision Tree model. We will then evaluate the fit
# Doing a train test split
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.2)
# Importing Modules and fitting to the trainin
from sklearn.tree import DecisionTreeClassifier
dt=DecisionTreeClassifier(max_depth=5)
dt.fit(X_train,y_train)
# Generating Model predictions for the diagnosis variable
y_pred=dt.predict(X_test)
# Generating the classification report
from sklearn.metrics import classification_report
print(classification_report(y_test,y_pred))
There are a few metrics on the accuracy report on the test data that are important
- Accuracy-: The accuracy of this model is 92%. This means that 92% of the time the model predicts the correct diagnosis. When we looked at the distribution of the diagnosis variable the benign tumors are 62% and cancerous are 38%, which suggests a random guessing model would be right about 62% of the time. This model is performing much better than random guessing.
- Precision (True Positive Rate)-: The precision with which this model identifies cancer correctly is 91%. In other words only 9% of the cases are false positives. This is great for a starting model but we might want to reduce this to the 1% range (increase precision to 99%) because a false positive on a cancer test would mean that the patient might have to endure needless expensive and painful cancer treatment.
We could further try ensemble learning methods like Random Forests to further improve the accuracy of this model.