It is very common in datascience tasks involving large number of features , that one is advised to PCA aka Principal component analysis . We is will start with a brief introduction to what and why of PCA . Then we will look into implementing PCA with explanation.
The What , whys of PCA
When there are lot of variables aka features n(> 10) , then we are advised to do PCA. PCA is a statistical technique which reduces the dimensions of the data and help us understand, plot the data with lesser dimension compared to original data. As the name says PCA helps us compute the Principal components in data. Principal components are basically vectors that are linearly uncorrelated and have a variance with in data. From the principal components top p is picked which have the most variance.
In the above plot what PCA does is it draws a line across the data which has maximum variance , 2nd maximum variance , so on … But Wait , why is an axes with maximum variance important ? Lets consider our problem to be classification problem (Similar arguments can also be drawn for other problems too) . Our goal separate data by drawing a line (or a plane) between data. If we find out the dimension which has maximum variance, then it solves part of the problem, now all we have to use suitable algorithm to draw the line or plane which splits the data.
Lets implement PCA
Lets begin by generating random data with 3 dimensions with 40 samples . We will have two class with 20 samples per class.
import numpy as np
vec1 = np.array([0, 0, 0])
mat1 = np.array([[1, 0, 0], [0, 1, 0], [0, 0, 1]])
sample_for_class1 = np.random.multivariate_normal(vec1, mat1, 20).Tassert sample_for_class1.shape == (3, 20), "The dimension of the sample_for_class1 matrix is not 3x20"
vec2 = np.array([1, 1, 1])
mat2 = np.array([[1, 0, 0], [0, 1, 0], [0, 0, 1]])
sample_for_class2 = np.random.multivariate_normal(vec2, mat2, 20).Tassert sample_for_class2.shape == (3, 20), "The dimension of the sample_for_class2 matrix is not 3x20"
all_data = np.concatenate((sample_for_class1, sample_for_class2), axis=1)assert all_data.shape == (3, 40), "The dimension of the all_data matrix is not 3x20"
Now we will compute the scatter matrix or covariance of this matrix. Why scatter matrix ? scatter matrix records the relationship between variables , which is important to find a dimension of maximum variance . More on scatter martrix , covariance here . Either of scatter matrix or covariance can be used as covariance is just scaled version of scatter matrix.
mean_dim1 = np.mean(all_data[0, :])
mean_dim2 = np.mean(all_data[1, :])
mean_dim3 = np.mean(all_data[2, :])
mean_vector = np.array([[mean_dim1], [mean_dim2], [mean_dim3]])
print('The Mean Vector:\n', mean_vector)
scatter_matrix = np.zeros((3,3))
for i in range(all_data.shape):
scatter_matrix += (all_data[:, i].reshape(3, 1) - mean_vector).dot((all_data[:, i].reshape(3, 1) - mean_vector).T)
print('The Scatter Matrix is :\n', scatter_matrix)output : ('The Mean Vector:\n', array([[0.41667492],
('The Scatter Matrix:\n', array([[38.4878051 , 10.50787213,11.13746016],
[10.50787213, 36.23651274, 11.96598642],
[11.13746016, 11.96598642, 49.73596619]]))
Now we know how each of variables are related to each other. We are just one step away from computing the principal components . Before that we will review a basic concept from Linear algebra : Eigenvalues and Eigenvectors.
Eigenvalues and Eigenvectors
Eigenvectors and Eigenvalues are a property of a matrix which satisfies the following equation .
Where A denotes the matrix , x denotes the eigenvector and lambda denotes the eigenvalues. Now to understand the significance of eigenvector and eigenvalues , lets observe into the following gif .
Here the original matrix is multiplied by a vector ( aka. undergoing a linear transformation), what we can observe is for lines colored blue and purple do not change direction , it only scales up , while the line colored red changes direction in addition it scaling up.
A Eigenvector of a matrix (which can be seen as a linear transformation of matrix) is the condensed vector which summarizes one axes of the matrix. Another property of eigen vectors is that even if I scale the vector by some amount before I multiply it, I still get the same multiple of it as a result.We might also say that eigenvectors are axes along which linear transformation acts, stretching or compressing input vectors.They are the lines of change that represent the action of the larger matrix, the very “line” in linear transformation. Eigenvectors and Eigenvalues are defined for a square matrix. We prefer our eigenvectors to be always unit length one .
If we have a eigenvector
We compute the length of the vector by
Then to make the eigenvector length one we do
Since this is just the scalar operation , it is not going to affect the direction of the vector.A simple reason we do this is that we need all of our eigenvectors to be comparable ,so that we can choose the most valuable among them (the one having maximum variance). A matrix can have more than one eigenvector , and at max d eigenvectors if matrix is of dimention d X d.
Back to PCA ….
Lets continue on where we left of PCA , we have the scatter matrix which has the information about how one variable is related to the other variable.
Now we use the numpy library to compute the Eigenvectors and eigenvalues
eig_val, eig_vec = np.linalg.eig(scatter_matrix)
By default numpy or any stats library gives out eigenvectors of unit length. Lets verify it
for ev in eig_vec:
Okay we could endup having upto 3 eigenvectors , since the size of our scatter matrix is 3 X 3 .Choosing k important eigenvectors from d vectors is same as droping d-k vectors . We now sort the eigenvectors by their eigenvalues and drop the least one.
# We Make a list of tuple containing (eigenvalue, eigenvector)
eig_pairs = [(np.abs(eig_val_sc[i]), eig_vec_sc[:,i]) for i in range(len(eig_val_sc))]
# We then Sort list of tuples by the eigenvalue
eig_pairs.sort(key=lambda x: x, reverse=True)
# verify that the list is correctly sorted by decreasing eigenvalues
for i in eig_pairs:
Now we choose k the largest eigenvectors :
matrix_w = np.hstack((eig_pairs.reshape(3,1), eig_pairs.reshape(3,1)))
print('Matrix W:\n', matrix_w)('Matrix W:\n', array([[-0.49210223, -0.64670286],
Finally we have the new axes across which we can project our samples , we just multiply the original vector with our chosen eigenvectors and plot.
transformed = matrix_w.T.dot(all_samples)
assert transformed.shape == (2,40), "The matrix is not 2x40 dimensional."
Given that we generated raw data , the plot may vary when we attempt to recreate.
Comparing for the same data the PCA from sklearn
Bingo !!! With this understanding by our side , we can define PCA as a process of finding the axes in our features space , viewed from which each samples in our data is separable in maximum way.