Principal Component Analysis (PCA)

5 min readJan 11, 2023

Principal Component Analysis, or PCA, is a dimensionality-reduction method that is often used to reduce the dimensionality of large data sets, by transforming a large set of variables into a smaller one that still contains most of the information in the large set.

Reducing the number of variables of a data set naturally comes at the expense of accuracy, but the trick in dimensionality reduction is to trade a little accuracy for simplicity. Because smaller data sets are easier to explore and visualize and make analyzing data much easier and faster for machine learning algorithms without extraneous variables to process.

So to sum up, the idea of PCA is simple — reduce the number of variables of a data set, while preserving as much information as possible.

How do you do PCA?

Steps

Step 1: Standardize the range of continuous initial variables

Step 2: Compute the covariance matrix to identify correlations on the standardized dataset

Step 3: Compute the eigenvectors and eigenvalues of the covariance matrix to identify the principal components

Step 4: Create a feature vector to decide which principal components to keep

Step 5: Recast the data along the principal components axes

How PCA works?

Here we have a dataset with 5 features, from X1 to X5. If we plot this data in a graph we will need a 5 dimensional space to plot the data as shown below.

PCA utilizes Eigenvalues and Eigenvectors to reduce the dimensions of the dataset. Let say we calculated and selected top 2 Eigenvectors, sorted descending, v1 and v2.

Now let’s plot these vectors on the original dataset.

What PCA does is it will take the N dimension data, in this case 5 and convert it to a M dimension data, 5, 4, 3 or 2 dimension based on number of Eigenvectors selected.

This new 2 dimension representation is very accurate to the original 5 dimension representation.

Example

Dataset

Here we have a dataset with 3 features, 3D and we will try to apply PCA and calculate the new features.

Steps

Step 1: Standardize the range of continuous initial variables

We can use any standardization technique here.

Calculate Mean and Standard Deviation

Standardization

Note: This step is optional. Doing it using a python will be simpler than manually. So we will skip this for this example.

Step 2: Compute the covariance matrix to identify correlations on the standardized or original dataset

To calculate the covariance of the features in the dataset follow the below steps -

Step 2.1: Calculate Mean of each features

Step 2.2: Calculate the product of the features you want to find the covariance for

Step 2.3: Calculate the Mean of the product of the features you want to find the covariance for

Step 2.4: Calculate covariance using the formula below

The Covariance for the example dataset will be -

It’s actually the sign of the covariance that matters :

if positive then : Two variables increase or decrease together (correlated)
if negative then : One increases when the other decreases (Inversely correlated)

Step 3: Compute the eigenvectors and eigenvalues of the covariance matrix to identify the principal components

After calculating we will get below Eigenvalues -

Eigenvectors

Step 4: Create a feature vector to decide which principal components to keep

If we rank the Eigenvalues in descending order, we get λ3 > λ2 > λ1, which means that the Eigenvector that corresponds to the first principal component (PC1) is v3 and the one that corresponds to the second component (PC2) is v2 and so on.

After having the principal components, to compute the percentage of variance (information) accounted for by each component, we divide the Eigenvalue of each component by the sum of Eigenvalues. If we apply this on the example above, we find that

What it means is that Eigenvector that corresponds to λ1 carries 2.8% of the variance of the data. Similarly Eigenvector that corresponds λ2 and λ3 carry 39.7% and 57.4% the variance of the data respectively.

If we sort the Eigenvalues descending and pick the top 2 corresponding Eigenvectors, for 2D, we will get.