In-Depth Guide to PCA

9 min readMar 25, 2024

Understand PCA from scratch, including a Python implementation.

Core Idea
Mathematics behind PCA
Eigendecomposition Approach
- Eigendecomposition
- Reconstruction
- Whitening
- PCA from scratch in Python, using eigendecomposition.
SVD Approach
- Covariance
- Note on scaling the data
- Projecting the data
- Whitening
Proportion of Variance Explained
Resources

Core Idea behind PCA

PCA is a dimensionality reduction tool. Simply put: it summarises data from multiple dimensions to just a few. How? Projecting data onto lower dimensions. Take for instance two variables: house size (m²) and the number of rooms. Instead of keeping these two variables, we capture them both in a single line, sacrificing some data. As visualised below:

There are a few ways to understand how PCA is choosing this line:

Think of the residuals (red lines) as springs. The farther they stretch, the more tension they exert, intuitive with Hooke’s Law. This pulls the black line closer to the data point, reaching an equilibrium.
Maximising the variance of the projected data points (blue dots), encapsulating as much information as possible.
Minimising the residuals (the distances between the original data points and their projections on the line).

Interestingly these approaches all lead to the same result.

Mathematics behind PCA

Let’s focus on the first principal component:

Projection of a data point x onto the first principal component.

The key term here is the ‘score’, given by

This calculates the magnitude by which we scale the first principal component vector:

to get the projection of our observation. Note: the principal components are unit vectors (length of 1), and they are orthogonal to each other, i.e

We repeat this process for all the observations, providing us with a score vector,

Now we can calculate the variance of the scores (projected data) for the first principal component:

Note: for simplicity, we are using a biased estimate of the variance. Now, as stated earlier, the goal is to maximise the variance of the projected data. So we use this variance in the optimisation problem:

The reason for the negative sign before the variance is to essentially minimise the negative expression, which is the same as maximising a positive expression. We also make use of a Lagrange multiplier, if you are unfamiliar with Lagrange multipliers, see my visual guide on Lagrange multipliers.

Simplifying the objective function we get, using our derivation of the variance:

Now to minimise this objective function we take the partial derivative with respect to the first principal component (w₁) and set it equal to 0:

Hence this becomes an eigenvector/eigenvalue equation.

To get back to our variance we left multiply both sides with w₁:

Where w₁ is our eigenvector. Providing us with our first principal component.

The other principal components are found in a similar fashion, where w₂ aligns with the second largest eigenvalue, and so on. Note however that the optimisation function is defined slightly differently, with an extra constraint: the principal components have to be orthogonal.

This leads us to the conclusion that these eigenvectors and eigenvalues can be extracted from the eigendecomposition of the covariance matrix:

Where Λ is the diagonal eigenvalue matrix, sorted from largest to smallest for convenience. Q is an orthogonal matrix, with each column representing a principal component vector. We will go into more detail in the next section on eigendecomposition.

Here is a simple illustration of the first two principal components in a 2D example:

Notice how most of the variance is captured by projecting the data onto w₁.

This concludes the mathematics required for deriving PCA from first principles, but we still haven’t done any dimensionality reduction yet! This now leads us to the eigendecomposition approach for PCA.

Eigendecomposition Approach

Take another look a the eigendecomposition of the covariance matrix S:

What does this look like visually?

Note that Λ represents the covariance matrix for the principal components:

Now we transition to the concept of dimensionality reduction, which is straightforward from here. Simply, dimensionality reduction involves limiting the number of principal components we choose to retain. By selecting only a subset of the principal components — essentially, a few columns from the matrix Q— we can project the original high-dimensional data onto a lower-dimensional space. This projection captures the most significant variance directions with fewer dimensions, simplifying the dataset while preserving as much of the original information as possible.

Mathematically, we select the first v columns of Q:

Let us now reconstruct the data from our v principal components:

The mean is subtracted to centre the data. Here is what that looks like expanded:

We see that each element of the reconstruction vector (z) represents each observation projected onto the principal component vectors. Note that it is of dimension v, no longer d.

Whitening

‘Whitening’ refers to normalizing the variance across each principal component to unity. Simply, this means adjusting the scale of each principal component such that its variance equals 1, ensuring that all components contribute equally to the variance.

To accomplish this, we inverse the square root of the eigenvalue matrix, ensuring that each principal component’s variance is normalized.

Firstly we adjust our eigendecomposition,

to now be the identity matrix:

This changes the reconstruction vector, to:

This is equivalent to:

Python Implementation

class PCA_eigendecomp:
    def __init__(self, n_components=2, whitening=False):
        """Initialize the PCA parameters."""
        self.n_components = n_components
        self.whitening = whitening
        self.N = None
        self.D = None
        self.sigma = None
        self.Q = None
    
    def fit(self, X):
        """
        Fit the model with X using Eigendecomposition on the covariance matrix.
        Steps:
        1. Centre Data
        2. Covariance
        3. Eigendecomposition
        4. Re-order eigenvalues and eigenvectors
        5. Ensure principal components are aligned with data's axis directions
        6. Dimension Reduction
        """
        # Shape of X: d x n
        self.N = X.shape[1]
        
        # entre Data
        self.D = X - np.mean(X, axis=1).reshape(-1, 1) # mean for each sample
        
        # Covariance
        cov = (1/self.N)*(self.D @ self.D.T)
        
        # Eigendecomposition
        sigma, Q = np.linalg.eigh(cov) # gives in ascending order

        # Re-ordering
        sigma = sigma[::-1]
        Q = Q[:, ::-1]

        # Ensure principal components are aligned with data's (optional, but nice for visualisations)
        for i in range(Q.shape[1]):
            if Q[0, i] < 0:  # If the first element is negative
                Q[:, i] *= -1  # Flip the sign of the whole component

        # Dimension Reduction
        self.sigma = sigma[:self.n_components]
        self.Q = Q[:, :self.n_components]
    
    def transform(self, X):
        """
        Project X onto the principal components Q.
        """
        # centre the new data
        X_centred = X - np.mean(X, axis=1).reshape(-1, 1)
        #              (v x d)  @ (d x n)
        D_projected =  self.Q.T @ X_centred
        if self.whitening:
            # Sigma^(-1/2) Q.T D      # (v x v) x (v x n)
            whitened = np.diag(1 / np.sqrt(self.sigma)) @ D_projected
            return whitened
        else:
            return D_projected
            
    def inverse_transform(self, D_projected):
        """
        Reconstruct Original Data from projections.
        """
        #      (d x v) @ (v x n)
        D_hat = self.Q @ D_projected
        return D_hat

Singular Value Decomposition (SVD) Approach

Going forward, we will assume that X is a centred (at 0) data matrix with dimensions d × n, where d represents the number of features and n the number of observations, same as before.

Singular Value Decomposition (SVD) offers an alternative method for executing PCA by decomposing the data matrix directly into three key components:

This approach sidesteps the need for the covariance matrix computation, presenting a more efficient route for large datasets. Through SVD, the principal components are found in U, and the singular values in Σ represent the standard deviations captured by each component, facilitating a straightforward dimensionality reduction and feature extraction process. We can see this here:

Notice how this looks identical to the eigendecomposition:

The only difference is we have not scaled the data by 1/N. To fix this we can divide X by the square root of N:

With this modification, Σ′ effectively contains the standard deviations of the principal components, reflecting the correct scaling for PCA.

Also, note that we have not performed dimensionality reduction yet, but it’s as simple as last time, taking v columns of the principal component matrix:

Projecting the data onto the principal components

Projecting a single observation, same as before:

Projecting the entire dataset onto the principal components yields the score matrix Z, a v × n matrix. This matrix contains the scores for each observation across the selected v principal components:

Here, Z encapsulates the principal component scores for all observations, effectively summarizing the dataset in the reduced-dimensional space.

Interestingly, we get the same result by left multiplying the SVD by U transposed:

We keep the approximation sign here because we are performing dimension reduction, taking the first v principal components, therefore losing some information/variance.

Whitening

Whitening is quite straightforward when using the SVD approach. Simply we just left multiply the above expression by the inverse of sigma (our standard deviations, hence standardising the data):

This effectively presents the beauty of this approach because V transposed represents our whitened data!

The principal components can be conceptualized as aligning with the standard basis, where the eigenvalues (now singular values) are transformed to unity. This results in a space where the transformed principal components have variances equal to one:

Proportion of Variance Explained

Simply the proportion of variance explained (how much information PCA retained) is the variance we kept over the total variance:

I hope you’ve found this guide useful, if you have any questions or feedback, or if there’s more you’d like to learn, please feel free to reach out. Thank you!