Unveil the math behind PCA using a simple 3-dimensional data set.

Krishanthi Weragalaarachchi
5 min readAug 26, 2022

--

Large data sets are extremely common these days but hard to interpret due to the high dimensionality (simply the number of variables) of the data. To understand this type of data in a meaningful form, we need to reduce the dimensionality to an interpretable form, such that most of the information is preserved. Principal Component Analysis (PCA) is a very common dimension reduction technique that is used nowadays by researchers and analysts.

Let’s say we have a data set with 20 variables. To understand how these variables are interconnected, a graphical representation may not so helpful as there would be nearly about 200 pairwise correlations between variables to consider. So, we use PCA: Principal Component Analysis to reduce the number of variables that we need to interpret the data set.

PCA simply transforms a large set of interrelated variables into a new set of variables while retaining the variation present in the original data as much as possible and, they are known as principal components (PC). PCs are orthogonal (uncorrelated to each other) and ordered such that the variation present in each PC as well as the importance decreases from PC1 to the last. In other words, the first principal component (PC1) contains the maximum variation present in the original data and PC2 contains the second, and so on.

As the topic suggests, let’s dive into the math!

Here I have a very simple 3-dimensional data set. These are the data for the annual consumption of potato, rice and meat per capita in kilograms for 5 different countries. M is the matrix representation of the data.

We can categorize these five countries by their preferences for “Potato”, “Rice” and “Meat”. If we plot these 3 variables in 3-Dimensional space, we can see that they are separated into 3 different clusters.

Let’s do PCA and see whether we can retain the same information we see in this 3D space.

These 3 food consumption per capita(variables) in 5 countries (observations) can be described by the mean vector and covariance matrix. The mean vector is often referred to as the centroid and the covariance matrix is the dispersion matrix which indicates the direction of the relationship of variables with respect to the mean. The first step is to compute the mean vector consisting of the mean for each annual food consumption.

Then, we need to compute the covariance matrix of M. The formula for computing the covariance of two variables x and y is defined as,

After substituting the values, we get,

This is also called variance-covariance matrix as diagonal elements contain the variance of each variable and off-diagonal elements contain the covariance of each pair of variables.

The total variance which is the sum of diagonal elements of the covariance matrix is also equal to the sum of eigenvalues of the covariance matrix.

In our case, annual rice consumption has a bigger variance (diagonal element:3299.3) whereas potato consumption has a smaller variance (diagonal element:1200.7). Also, as the Potato consumption increases, the meat consumption increases (off-diagonal element:280.3), and as the potato or the meat consumption increases, the rice consumption decreases and vice versa (off-diagonal elements: -974.4 and -1585.6 respectively).

The next step is to find the eigenvalues and eigenvectors of the covariance matrix. The eigenvector of the covariance matrix represents the direction of covariance and, it can be used to reorient the data from original dimensions to a new dimension represented by the principal components.

The eigenvalues are simply the coefficients attached to eigenvectors which represent the magnitude of the direction.

if λ is an eigenvalue and v is an eigenvector associated with λ of the covariance matrix Σ, then Σv = λv.

The roots of det(Σ λ I) = 0 give the eigenvalues of covariance matrix Σ.

The eigenvalues in descending order are,

λ1= 4506.3822 λ2 = 1021.1097 λ3= 429.7081

by substituting λ in Σv = λv we can find eigenvectors (v) for each eigenvalue(λ).

The eigenvectors are a representation of the principal components. We use the corresponding eigenvalue to find the total variance explained by each component. So,

Variance explained by PC = corresponding λ / Total variance (or sum of all λ)

This means that the eigenvectors with the lowest eigenvalues bear the least information. Therefore, it is important to select the highest eigenvalue as λ1, the next highest as λ2 and so on.

Next, we can calculate PCs by M x v. This gives;

As discussed above, the idea behind PCA is dimensionality reduction. We can exclude the third principal component which has the lowest contribution and plot the first principal component against the second to see whether we can retain the same information as before.

Voila! We can retain the same 3 clusters!

Important: We usually need to scale the data to 0 mean and unit variance before analyzing. If done, you will be able to replicate the same results you would get from prcomp in R. Please refer to my notebook for more details.

Thanks!

--

--