Unraveling PCA (Principal Component Analysis) in Python

Sambit Mahapatra
Journey 2 Artificial Intelligence

--

Principal Component Analysis (PCA) is a simple yet powerful linear transformation or dimensionality reduction technique that is used in many applications ranging from image processing to stock market predictions etc. Here, we are going to unravel the black box hidden behind the name PCA.

PCA is an unsupervised technique. The main goal of a PCA analysis is to identify patterns in data; PCA aims to detect the correlation between variables. If a strong correlation between variables exists, the attempt to reduce the dimensionality only makes sense. In a nutshell, this is what PCA is all about: Finding the directions of maximum variance in high-dimensional data and project it onto a smaller dimensional subspace while retaining most of the information. So main advantages of PCA are data compression (reduce memory, speed up learning) and visualization.

Steps for PCA :-

1. Standardize the data (n- dimensional).

2. Obtain the Eigenvectors and Eigenvalues ( from the covariance matrix or correlation matrix, or perform Singular Vector Decomposition. )

3. Choose the k eigen vectors that correspond to the top k eigenvalues where k is the number of dimensions of the new feature subspace (k<=n).

4. Construct the projection matrix from the selected k eigenvectors.

5. Transform the original dataset X via the projection matrix to obtain a k-dimensional feature subspace X_new.

Now, let’s build the PCA model from scratch. Source code is available in the github link —

The dataset used here is bank note authentication dataset publicly available in UCI machine learning repository.

https://archive.ics.uci.edu/ml/datasets/banknote+authentication#

The attributes present in the dataset are variance of Wavelet Transformed image (continuous), skewness of Wavelet Transformed image (continuous), curtosis of Wavelet Transformed image (continuous), entropy of image (continuous), class (integer) (0-not authentic, 1-authentic). Before starting the Principal Component Analysis, first import all the required dependencies.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

Now load the dataset into the Data Frame using read_csv function from Pandas.

columns = ["var","skewness","curtosis","entropy","class"]
df = pd.read_csv("http://archive.ics.uci.edu/ml/machine-learning-databases/00267/\
data_banknote_authentication.txt",index_col=False, names = columns)

The dataset contains total 1372 instances, out of which 762 are of non authentic notes and 610 of authentic notes. The data distribution of the attributes looks like using both univariate and multivariate plots :

f, ax = plt.subplots(1, 4, figsize=(10,3))
vis1 = sns.distplot(df["var"],bins=10, ax= ax[0])
vis2 = sns.distplot(df["skewness"],bins=10, ax=ax[1])
vis3 = sns.distplot(df["curtosis"],bins=10, ax= ax[2])
vis4 = sns.distplot(df["entropy"],bins=10, ax=ax[3])
f.savefig('subplot.png')
sns.pairplot(df, hue="class")

Now the very first step for PCA is standardizing the data. Whether to standardize the data prior to a PCA on the covariance matrix depends on the measurement scales of the original features. Since PCA yields a feature subspace that maximizes the variance along the axes, it makes sense to standardize the data, especially, if it was measured on different scales. Here we are transforming the data onto unit scale (mean=0 and variance=1), which is a requirement for the optimal performance of many machine learning algorithms. Before that, first we need to split the data frame into attribute set X and class labels y.

# split data table into data X and class labels y
X = df.ix[:,0:4].values
y = df.ix[:,4].values
from sklearn.preprocessing import StandardScaler
X_sd = StandardScaler().fit_transform(X)

Next step is to calculate eigen values and eigen vectors corresponding to input data. It can be calculated in 3 ways either from co-variance matrix or from correlation matrix or from singular vector decomposition. A co-variance matrix is a more generalized form of a simple correlation matrix or you can say correlation is a scaled version of co-variance. Basically, you tend to use the co-variance matrix when the variable scales are similar and the correlation matrix when variables are on different scales. Another computationally more efficient way of getting eigen values and eigen vectors is Singular-value decomposition method.

Here, I am displaying eigen values and eigen vectors calculation from co-variance matrix. Other methods are clearly mentioned in the github link provided.

print('NumPy covariance matrix: \n%s' %np.cov(X_sd.T)
e_vals, e_vecs = np.linalg.eig(cov_mat)
print('Eigenvectors \n%s' %e_vecs)
print('\nEigenvalues \n%s' %e_vals)

In order to decide which eigen vector(s) can dropped without losing too much information for the construction of lower-dimensional subspace, we need to inspect the corresponding eigenvalues: The eigen vectors with the lowest eigenvalues bear the least information about the distribution of the data; those are the ones can be dropped. In order to do so, the common approach is to rank the eigenvalues from highest to lowest in order choose the top k eigen vectors.

var_exp = [(i / sum(e_vals))*100 for i in sorted(e_vals, reverse=True)]
np.cumsum(var_exp)
output -
[ 54.49760184, 86.82647434, 95.61103519, 100.]

That means by considering eigen vector corresponding to largest eigen value we can retain 54.5% variance, considering eigen vectors corresponding to top 2 eigen values we can retain 86.8 % variance. For data compression purpose, we generally go for 99% variance retention, while for visualization we make the dimension to 2 or 3.

Now a reduction matrix of k eigen vectors corresponding to top k eigen values will be formed. Here, k is set to 2 for better visualization. Then the n-dimensional feature space is transformed to k-dimensional feature subspace via reductio matrix.

matrix_w = np.hstack((e_pairs[0][1].reshape(4,1), 
e_pairs[1][1].reshape(4,1)))
print('Matrix W:\n', matrix_w)
X_new = X_sd.dot(matrix_w)

Now the data distribution in the two dimensional feature subspace looks like:

Using PCA to avoid over-fitting is a very bad practice.

Another feature dimensionality method is LDA (Linear Discriminant Analysis), which is also linear but supervised and very effective transformation technique.

For further study :

http://sebastianraschka.com/Articles/2014_pca_step_by_step.html

--

--

Sambit Mahapatra
Journey 2 Artificial Intelligence

Putting ML to Customer Support at CSAT.AI | Natural Language Processing | Full Stack Data Scientist (sambit9238@gmail.com)