Principal Component Analysis(PCA)

Published in

Intro to Artificial Intelligence

7 min readJul 10, 2023

The figure shows the reduction of features from 3 to 1. Source:[7]

Dimensionality reduction is the process of transforming the data from high-dimensional space to low-dimensional space[5]. In other words, it is the technique to reduce the number of features in a dataset while preserving as much important information as possible[6]. The curse of dimensionality is a challenge in ML as the performance of the model decreases when the number of features increases after an optimal point[6]. Also, processing such data required high computation requirements[2]. There are two ways the problem of high dimensionality is handled: feature selection and feature extraction[3]. Feature selection is the process of selecting the important features while filtering out irrelevant features[3]. On the other hand, feature extraction is a technique to create new and more relevant features from the original features[3]. Today, we discuss principal component analysis (PCA) which is one of the unsupervised feature extractions technique.

PCA transforms the correlated features in the dataset into a set of linearly independent(orthogonal) features which has important information while reducing the irrelevant features or dimensionality[3]. In other words, PCA identifies the set of orthogonal axes (principal components) that has maximum variance in the data[2]. We can also say that PCA draws straight lines through data like linear regression and each straight line represents principal components[1]. So, in high-dimension data, there are many such straight lines that are possible and the role of PCA is to identify and prioritize them[1].

The figure shows the transformation of the data from 3D feature space to 2d feature space. In the transformation, two new features are created which are called principal components instead of three original features. Source:[8]

The first principal component always captures the most variation in the data and the second component captures the maximum variance that is orthogonal to the first component.

Two or more lines or line segments which are perpendicular are said to be orthogonal.
Variance is a measurement of the spread between numbers in a data set. In particular, it measures the degree of dispersion of data around the sample’s mean.
My intuition is that each of the new principal components becomes new features that are independent of each other. Orthogonal indicates that there these components of new features are independent.

Pseudocode of PCA

Standardization.
Compute the covariance matrix.
Compute eigenvectors and eigenvalues from the covariance matrix.
Compute the feature vector and principal components
Project the data onto the selected principal components for dimensionality reduction.

We are going to explain PCA steps with an example taken from [4].

Standardization

Firstly, we need to standardize the data with a mean and standard deviation[2]. It is because PCA works under the assumption that the data is normally distributed, and is very sensitive to the variance of the features or variables[9]. It means that large differences between the ranges of features will dominate over those with small ranges[9].

Z score computation formula. where x is the data samples, mu is the mean and sigma is the standard deviation. Source: [9]

The most common way of standardizing the data using a z-score is by using the above formula. After the standardization, the data of all the features will be transformed to the same scale[9]. For instance, consider the example described in [4] where we need to find a few patterns of good-quality apples in a warehouse. The data sample has four features: large size apples, rotten apples, damaged apples, and small apples. The data contains in each feature varies based on their category. So, it is important to standardize the data to the same scale.

Covariance matrix

The covariance matrix is a square matrix that has variance of the data and covariance between variables[1]. It measures how each variable is associated with one another using a covariance matrix[3].In other words, it provides the empirical description of the data[1] or shows how features in the data are correlated.

As you noticed, it is a collection of variance and covariance. In reality, it is a collection of covariance between two features, but the covariance of two same features is called variance. For instance covar(f1,f1) = var(f1).

Calculating the covariance of features f1 and f2. Source:[2]

The value of covariance can be positive, negative, and zero. The positive value indicates if f1 increases then f2 increases. Whereas, a negative value is a sign of the reverse direction of the features’ relationship. That means if f1 increases f2 decreases. Zero values mean that there is no direct relationship.

Another important factor here is if the data has n features, then the covariance matrix will be n x n square matrix. In this example, we have 4 features, so a 4x4 matrix.

Eigenvector and eigenvalues

The next task is to find the principal components that fit the highest variance of the data. In order to do that, we need to compute eigenvectors and eigenvalues. The eigenvector gives the direction of the spread of the data or the highest variance of the data[3]. They are called right vectors too as it is column vectors [3]. Whereas eigenvalues give the relative importance of these directions [3]. Finding the eigenvectors and eigenvalues of the covariance matrix is the equivalent of fitting those straight, principal-component lines to the variance of the data[1].

Source: [4]

The eigenvector, v, and eigenvalues, lamdas can be defined using the above equation. A can be any square matric. In our case, we have computed the covariance matrix which is a square matrix. Using the above equation, eigenvector and eigenvalues are derived. In the above example, when we solve the equation we will have four eigenvalues corresponding to 4 features. We use each eigenvalues (lambda) to find the eigenvectors. So, in total there will be 4 eigenvectors. The eigenvector is a column vector and each eigenvector will have one column and the same number of rows as the covariance matrix. In this example, each eigenvector will have 4 rows.

4 eigenvectors based on 4 eigenvalues. We call it as feature vector.

Feature vector & principal components

Once we compute the eigenvectors based on the eigenvalues, we can form the feature vector as shown in the above table. Eigenvectors are our principal components and eigenvalues give the relative importance of those components. All eigenvectors will be perpendicular (orthogonal) to the one calculated before it[10]. That is why we can say that that each of the principal components will be uncorrelated or independent from one another.

The example of eigenvectors and eigenvalues of the dataset with two variables A & B taken from [10]. In our case, variables are features & A and B corresponding to f1 and f2.

The principal components (eigenvectors) are sorted by descending eigenvalue. Then, the principal component with the highest eigenvalue is our first principal component as it accounts for the highest variance or spread of the data[3,10].

The new feature vector which has the first two princpal components that have the highest eigenvalues(eigenvectors)

In this example, we have 4 features and there will be four principal components. After sorting, we select 2 out of 4 as it has the highest eigenvalues which account for the highest spread of the data, so the feature vector will be reduced with two columns as shown in the above table. it means that we have two principal components that can describe most of the data.

Visual representation of principal components one and two that fit on the dataset which is taken from source [10]. In our example variables are features.

Project the data onto the selected principal components for dimensionality reduction

Now, we have selected 2 principal components out of 4 components. Now we can reduce the dimensionality of the data by applying the following formula[4].

Final Data Set= Standardized Original Data Set * FeatureVector

Example of the projected sample after applying the equation above.

Using the formula, we can project the data to the dimension of the feature vector, which is 2. In other words, it allows us to reduce the dimensionality of the data to the number of principal components we have selected. In this example, we have selected 2 components as feature vector. It means we have reduced the dimensionality of the dataset from 4 to 2.

If you like my write-up, follow me on Github, Linkedin, and/or Medium profile.