The Mathematics Behind Principal Component Analysis (PCA)

Madhav Samariya
Analytics Vidhya
Published in
6 min readJul 11, 2020

In this article I’ll mathematically derive how PCA works and implement it in python both from scratch and using scikit learn.

Photo by Volodymyr Hryshchenko on Unsplash

Dimensionality Reduction refer to techniques to reduce the number of features/variables in our data. We do dimensionality reduction to:

  • Visualise higher dimensional data on 2D plots.
  • Get rid of correlated and redundant features.
  • Prevent overfitting the data.
  • Tackle the CURSE OF DIMENSIONALITY.

Principal component analysis is one of the most popular technique for dimensionality reduction. The key idea behind PCA is that it creates new features and project the original data onto these features which are a linear combination of original features with the objective of maximizing the total variation in the data i.e. to retain as much information as possible.

Geometric Interpretation of PCA

Let’s try to understand PCA in 2 dimensions. We will try to reduce 2 dimensional data to 1 dimensional data. The same idea can be extended to higher dimensions as well. Our objective is to preserve the direction with maximal spread/variance. Let’s start with a simple case.

NOTE: Always STANDARDIZE your data before applying PCA.

Suppose we have the following standardized data:

If suppose you have to choose 1 feature out of X1 and X2, which feature will you choose to represent the data? Which feature according to you seems more important? The one which explains the maximum variation in the data. That is feature X1. This is exactly what PCA does. It finds the features which have maximum spread and drop the others with the aim to minimize information loss.

Let’s take a slightly complex example where we can not simply drop one feature.

Here, both the features X1 and X2 have equal spread. So we can’t tell which feature is more important. But if we try to find a direction (or axis) which explains the variation in data we can find a line which fits the data very well. So if we rotate our axis slightly by theta, we get f1 and f2 (perpendicular to f1). We can then drop f2 and say f1 is the most important feature.

This is exactly what PCA does. It takes the data and tries to find a direction f1 such that variance of points projected on f1 is maximum.

Note that we are transforming our data by creating new features.

Okay this sounds cool but how does PCA finds such direction? PCA uses the concepts of linear algebra and optimization theory to find the direction with maximum spread. Let’s look into it.

Mathematics Behind PCA

Our aim is to find the direction with maximum spread and project the data points on that direction.

Let’s try to find a line that maximizes the distance of projected point to the origin i.e. maximize the variance of the projected distance.

This approach is called variance maximization approach. There is another way by which we can construct the optimization function for PCA.

Another way to think of PCA is that it fits the best line that passes through our data with an aim to minimize the projection error ‘d’ for each point. This approach is called distance minimization approach.

Notice that both the optimization problems, though look different, are same. Since the (x^T *x) term is independent of u so in order to minimize the function we have to maximize (u^t *u)² which is same as our first optimization problem.

Before solving this optimization problem using Lagrange's Multiplier let’s understand some terms.

Suppose X is our data matrix with n observations and d features.

  1. Covariance Matrix

Covariance Matrix (or variance covariance matrix) is a square matrix of shape= number of features whose diagonal elements are the variance of each feature and off diagonal elements are covariance between features.

2. Eigenvalue and Eigenvector

Corresponding to each eigenvalue, there exists an eigenvector. All the eigenvectors are orthogonal.

Solving the optimization problem

Let’s look at our optimization problem once again.

Now let’s try to solve our modified optimization problem using Lagrange’s Multiplier.

Let lambda be our lagrange’s multiplier.

Thus u is an eigenvector of the covariance matrix S corresponding to the largest eigenvalue lambda.

Let’s write all the steps once again:

Suppose our data matrix X has d features and we want to reduce them to k features.

  1. Column standardize your data.
  2. Find the covariance matrix.
  3. Find all eigenvalues and eigenvectors of the covariance matrix.
  4. Then v1 corresponding to largest eigenvalue lambda1 is the direction with maximum variance, v2 corresponding to lambda 2 is the direction with second maximum variance and so on.
  5. To get k features, multiply the original data matrix with the matrix of eigenvectors corresponding to top k largest eigenvalues.

The resultant matrix is the matrix with reduced features.

Let’s understand the interpretation of eigenvalues.

PCA in Python

I’ll implement PCA on the iris dataset. Link:

From Scratch

#Importing the data
data= pd.read_csv("Iris.csv")
data.head()
X = data.iloc[:, 0:4]
y = data.species
#Step 1: Let's scale the data.from sklearn.preprocessing import StandardScaler
X_scaled = StandardScaler().fit_transform(X)
#Step 2: Find the covariance matrix.covar_matrix = (1 / X_scaled.shape[0]) * np.matmul(X_scaled.T,X_scaled)
print (covar_matrix.shape)
#Step 3: Find the eigenvalues and eigenvectors.from scipy.linalg import eigh#eigh function returns the eigenvalues in ascending order
#We specify the top 2 eigenvalues (out of 0, 1, 2, 3)
values, vectors = eigh(covar_matrix, eigvals=(2, 3))
print (vectors.shape)
print (vectors)
#Step 4: Project the original data on eigenvectors.pca_components = np.matmul(X_scaled, vectors)
pca_data = pd.DataFrame(np.hstack((pca_components, y.to_numpy().reshape(-1, 1))), columns=["Component 1", "Component 2", "Species"])
pca_data.head()
#Let's calculate the percentage of variation explained.
print (values.sum() / eigh(covar_matrix)[0].sum() * 100)
#Now let’s plot the principal components.
sns.scatterplot(x= "Component 1", y= "Component 2", hue= "Species", data= pca_data)

Using scikit learn

from sklearn.decomposition import PCA#Using the n_components argument
pca = PCA(n_components= 2)
pca_components = pca.fit_transform(X_scaled)
pca_data = pd.DataFrame(np.hstack((pca_components, y.to_numpy().reshape(-1, 1))), columns=["Component 1", "Component 2", "Species"])
pca_data.head()

Note that the components got swapped in sklearn implementation.

print (pca.explained_variance_ratio_.sum() *100)
sns.scatterplot(x= “Component 1”, y= “Component 2”, hue= “Species”, data= pca_data)

Thank you for reading.

You can connect with me on LinkedIn: www.linkedin.com/in/madhav-samariya-7a636b17a

--

--