Understanding Principal Component Analysis (PCA) — Simplified with Examples and Code

Biswajit Rajguru Mohapatra
3 min readMar 7, 2023

--

A picture showcasing higher dimention.

Introduction

Principal Component Analysis (PCA) is a powerful statistical tool used to reduce the dimensionality of large datasets while retaining as much information as possible. It is widely used in various fields like finance, economics, engineering, neuroscience, and genetics. In this blog post, we will explore the concept of PCA, why we need it, where we use it, types of PCA with short descriptions, implementation, and conclusion.

Why we need PCA?

In many cases, we deal with high-dimensional data, which makes it difficult to analyse, visualize, or store. PCA helps to reduce the number of dimensions while retaining the important information. For instance, let’s say we have a dataset with hundreds of features, and we want to predict the price of a house. We can use PCA to reduce the number of features while retaining the most important ones like the number of rooms, location, and square footage.

Where do we use PCA?

PCA is widely used in various fields like finance, economics, engineering, neuroscience, and genetics. In finance, PCA is used to analyse the stock market and build portfolios. In neuroscience, it is used to analyse brain activity, and in genetics, it is used to analyse gene expression data.

Types of PCA

There are two types of PCA: standard PCA and incremental PCA.

Standard PCA:

Standard PCA is the most common type of PCA, which works well with small datasets that can fit into memory. It calculates the eigenvectors and eigenvalues of the covariance matrix and sorts them in descending order based on their corresponding eigenvalues. The eigenvectors with the highest eigenvalues represent the most important features.

Incremental PCA:

Incremental PCA is used for large datasets that cannot fit into memory. It processes the data in batches and calculates the eigenvectors and eigenvalues of each batch separately. It combines the eigenvectors and eigenvalues of all batches to obtain the final eigenvectors and eigenvalues.

Implementation:

Let’s take a simple example of PCA implementation using the scikit-learn library in Python. We will use the famous Iris dataset, which consists of 150 samples with four features — sepal length, sepal width, petal length, and petal width.

First, we will import the necessary libraries and load the dataset.

Next, we will apply PCA to the dataset and reduce it to two dimensions using the PCA class from the scikit-learn library.

Conclusion:

PCA is a powerful statistical tool that helps to reduce the dimensionality of large datasets while retaining as much information as possible. It is widely used in various fields like finance, economics, engineering, neuroscience, and genetics. In this blog post, we discussed the concept of PCA, why we need it, where we use it, types of PCA with short descriptions, implementation, and conclusion. PCA is a valuable tool that can be used to make sense of high-dimensional data and improve the accuracy of machine learning models.

If you like the post, please consider following my blog account. Happy learning!

--

--

Biswajit Rajguru Mohapatra

A passonate data scientist marking his journey towards building an AI product.