Build a Principal Component Analysis (PCA) Algorithm from Scratch

Understand the Concept and Build PCA from Scratch using NumPy

Ayo Akinkugbe
Technological Singularity
8 min readApr 16, 2024

--

Photo by charlesdeluvio on Unsplash

Introduction

The Principal Component Analysis (PCA) algorithm cannot be discussed without first diving into the core problem it solves.

A data point can be represented by one or more dimensions. The number of dimensions making up such data point determines its dimensionality i.e points with a high number of dimensions contributing to them are referred to as having high dimensionality. An example of this in Machine learning is in making a prediction from a dataset with a high number of features. The thing is high dimensionality isn’t handled well by machines. This is referred to as the Curse of Dimensionality — the numerous problems faced when analyzing or working with high dimension datasets. This could range from — difficulty creating and training machine learning models with the data, problems interpreting, explaining or visualizing the data, and expensive computation.

Enter PCA — a dimensionality reduction algorithm which derives new dimensions or features (called principal components) from the existing ones, while encapsulating as much information from the dataset into the first principal component, then the next and so on. The PCA algorithm produces as many principal components as there are features. However, principal component analysis analyzes the principal components with the most information, thereby giving an opportunity to rank the dominant components for fewer feature selection (choosing the components with the most retained information), thereby reducing the dimension of the data.

chart from psynlig documentation

The diagram above shows 13 principal components, with the length of each bar highlighting the about of information retained by each components.

This project walks through creating a PCA from the ground up while using a sample dataset. The project:

  • explains the theory and algebra behind the PCA algorithm
  • Implements a PCA algorithm using NumPy
  • Compares the implementation’s accuracy to Sci-kit Learn’s PCA implementation

Theory

Essentially the PCA algorithm is a type of orthogonal linear transformation. An orthogonal linear transformation takes a geometrical object (which can be represented with a set of vectors), rotates or flips the object while preserving its geometry (all angles, and sizes). A specific type of transformation where the destination coordinate system has fewer dimensions than the source coordinate system is referred to as a projection. This is the goal with PCA — taking the features in the dataset (existing in multiple dimensions) and projecting it on a new dimension (vector) where each feature takes up as much space as possible in order to preserve the variance of the original dataset.

Reframing the problem this way helps surface important questions:

  • How can the direction or magnitude of the new dimension be determined?
  • How can the variance of each feature when projected to a new dimension be maximized?

To address these two questions,

Find and optimize the variance of the vector projection of the dataset mathematically.

First project the data 𝑥 and its mean.

Projecting 𝑥 on a unit vector 𝓊, the new projection looks like:

Projecting the mean , the mean projection looks like:

To find the variance of the vector projection — subtract the mean from the each value 𝑥 in dataset, squaring it, divide the number of datapoints in the dataset n.

Subtracting the magnitude mean projection from the magnitude of the dataset 𝑥 projection, squaring it and optimize for its maximum value:

Factorizing and expanding squares:

Rearranging the last two matrices in the summation (if the positional order of matrices are swapped, the transpose function can be swapped):

Factorizing 𝓊 transpose:

The covariate C of the dataset is similar to a part of the expression

Substituting for C, the expression becomes:

The goal is to find the maximum of the above expression:

such that:

Since this is an optimization problem. Lagrange multiplier, a mathematical method for finding maximums and minimums values of functions, can be leveraged here. Its theorem asserts that the partial derivative of the function to be maximized is equal to the partial derivative of it’s constraint function, multiplied by a multiplier 𝝺 :

Leveraging another mathematical method — finding derivatives of a matrix, the above derivative becomes :

The above expression is similar to a mathematical construct called eigenvalues and eigenvectors. An eigenvector is a unit vector whose direction remains unchanged when a linear transformation is applied to it. An eigenvalue is the magnitude of such vector. Remember, the PCA algorithm needs to project multiple dimensions in the dataset onto a new dimension to derive a principal component. To do this, it needs a composition of the existing dimensions in the new dimension. Eigenvectors allow for this composition. The eigenvalues ascribe a magnitude to describe the variance captured in the component’s composition. This answers the first of the earlier questions — How can the direction or magnitude of the new dimension be determined? To find this, find the eigenvector and eigenvalues of C, the covariate matrix.

However the second question still remains — how best can the variance or spread in the new dimension be maximize?

Multiplying both sides of the equation with the transpose of the unit vector:

Remember the constraint stated earlier:

The equation becomes

Since the left expression is the same expression to be maximized, it follows that to maximize the variance or spread of the projection of the data, — maximize the eigenvalue 𝝺 (i.e choose the largest value of the eigenvalue).

Dataset

This implementation uses the Iris flower dataset. The Iris dataset is a popular dataset containing sepal length, width and petal length, width for 3 different types of irises’ (Setosa, Versicolour, and Virginica) stored in a 150x4. Each type of irises has 50 instances in the dataset. The labels are removed for PCA since the goal is dimensionality reduction.

Code Implementation

A copy of the code and data files for this project can be found here.

Step 1 — Standardize dataset

Why — This step ensures the algorithm (especially distance based algorithms) can process the dataset efficiently. If data is not standardized, larger values might have more weight or dominance on the algorithm. All data should weigh equally which is why data standardization is important.

How — calculate the Z-score (standardized value) for each value in the dataset by subtracting the mean of the feature or column from each value 𝑥 in the column and dividing by the standard deviation 𝝈.

# Import pandas and numpy library
import pandas as pd
import numpy as np
# Import data using pandas
iris = pd.read_csv('IRIS.csv')
iris
#Drop label (Y) for dataset

iris_features = iris.drop(columns=['species'])
#standardize dataset

iris_features = (iris_features - iris_features.mean()) / iris_features.std()

#convert dataframe to numpy array
iris_features = iris_features.to_numpy()

iris_features

Step 2 — Compute Covariance Matrix

def covariance_matrix(X, n_rows):

"""The function is a matrix multiplication of dataset and the transpose of the dataset.
See the formula for C in theory section. Mean is not substracted as it was substracted when standardizing the data """

covariance_matrix = sum([X[i].reshape(-1,1) @ X[i].reshape(1,-1)for i in range(n_rows)])/ n_rows
return covariance_matrix

covariance_matrix(iris_features, 150)

Step 3 — Derive Eigenvectors and Eigenvalues of the Covariance matrix

def eigen_vectors(X, n_rows):
C = covariance_matrix(X, n_rows)
eigen_vectors = np.linalg.eig(C)
return eigen_vectors

# the function returns two arrays - the first are the eigen values and the second the eigen vectors
eigen_values, eigen_vectors = eigen_vectors(iris_features, 150)

Step 4 — Sort the Eigenvalues and Eigenvectors

#sort eigenvalues from smallest to greatest magnitude using argsort function

sort = np.argsort(eigen_values)[::-1]

# use sort array to order eigenvectors

principal_components = eigen_vectors[:,sort]

principal_components

Step 5 — Return the Principal Components

Using all mini-functions from above, create one function that serves the overall purpose of deriving the principal components.

def PCA_from_Scratch(X, n_rows, n_components):
C = sum([X[i].reshape(-1,1) @ X[i].reshape(1,-1)for i in range(n_rows)])/ n_rows
eigen_values, eigen_vectors = np.linalg.eig(C)
sort = np.argsort(eigen_values)[::-1]
principal_components = eigen_vectors[:,sort]
return principal_components[:n_components]

PCA_from_Scratch(iris_features, 150,3)

Step 6 — Project the data on the principal components

def transform(X,principal_components):
X = X.copy()
X_proj = X.dot(principal_components.T)
return X_proj


# transforming the iris dataset from four dimensions to two dimensions

X_transform = transform(iris_features,principal_components_2)

X_transform

Compare Results with Scikit Learn Library

from sklearn.decomposition import PCA

#initialize Scikit PCA
pca = PCA(n_components=4)

#Fit Scikit PCA using data
pca.fit(iris_features)
# generate principal components

SKL_PCA_components = pca.components_.T

SKL_PCA_components

The principal components derived are similar to those calculated using the Scikit Learn library

Conclusion

The principal components can be used to reduce the dimensionality of the data and represent the data accordingly. The eigenvalues describe the magnitude of the information retained from the dataset in the new dimension and the loss to be anticipated. This means that the larger the eigenvalues, the better the principal component for dimensionality reduction.

--

--