Principal Component Analysis for Dimensionality Reduction

Nitin Chauhan
Analytics Vidhya
Published in
6 min readApr 28, 2020

Machine Learning is the field where DATA is considered as a boon in the industry. In Machine Learning, having too much data can sometimes also lead to bad results. At a point have more features (dimensions) in your data can decrease the quality of your model. This term is known as the curse of dimensionality in Data Science.

What is Dimensionality Reduction?

With the growth in the Data Industry and Internet Users, there is a tremendous amount of Data Generated every day. Multinational Companies are interested in collecting more and more data of the user to improve their user experience, understand them better. we can say that most of the companies are now becoming a kind of Data-Driven company to become the best in their industry. For this they keeps collecting Data which increases day by day a lot.

It is a general understanding that not all the data will be useful for training the AI models. Many unimportant dimensions are eliminated in beginning stages through visualizations which have not significant effect on the results.

And further dimensionality is reduced using dimensionality reduction techniques.

Why?

  1. Fewer dimensions lead to less training time.
  2. It takes care of the multicollinearity factor and removes redundant features.
  3. Easy to visualize less dimensional data.

And two important and State of the art techniques of Dimensionality Reduction are PCA (Principal Component Analysis) and TSNE (T-distributed Stochastic Neighbor Embedding). In this blog, I will cover all the theoretical concepts of PCA.

It’s Easy. So hold your Breathe and Let’s Start 📖

Principal Component Analysis

it is an unsupervised linear transformation technique mainly used in feature extraction and dimensional reduction. Unsupervised as it finds patterns and regularities without any supervision, i.e. by itself.

The base of the Principal Component Analysis is the standard statistical operations. Mainly- mean, covariance, Eigen Vectors, Eigen Values.

So before going to understanding and calculations lets break it into pieces what we are going to do:

  1. Standardize the d-dimensional dataset and obtain a mean for every dimension.
  2. Calculate the Covariance Matrix.
  3. Evaluating Eigen Values from the Covariance Matrix.
  4. Calculate Eigen Vectors from Eigen Values.
  5. Select k eigenvectors which correspond to the k largest eigenvalues, where k is the dimension to which you want to reduce your dataset (k ≤ d).
  6. Principal component or Projection Matrix from selected Eigen Vectors.
  7. Transform the d-dimensional input dataset using the projection matrix to obtain the new k-dimensional dataset.

Thats it only this much 😎

I will explain all this with a random two-dimensional example which you can further apply to any multidimensional dataset to reduce dimensions.

So Let’s Begin Guys….📑

So, for example, let’s take this simple 2 dimension data having 2 columns X and Y.

Simple Dataset with 2 dimensions.

So our first and basic step is to calculate the mean of every column as calculated in the above image mean_X = 1.81 and mean_Y = 1.91.

There are two options either you can standardize the data first or just start with mean and it will be balanced in the Covariance Matrix Calculation

Covariance Matrix formula

C is the main covariance matrix that we need to calculate. Note: As we have two columns so its dimension is 2 x 2. If we have more columns it will increase accordingly.

For the complete Covariance matrix, we need to calculate the covariance of different combinations. The formula of that is shown above i.e. Cov.

Example →

A small example of how we can calculate the covariance of a particular combination.

Like this, we will calculate our complete Covariance Matrix and the final matrix will look like this →

Final Covariance matrix

Now we have done half of Our Task to calculate the Principal Component. Now here we need to calculate eigenvectors. For that we use formula.

C — λ I = 0

where C is the Covariance Matrix that we calculated and I is the Identity matrix. λ will give us our eigenvalues.

Two eigenvalues obtained using the equation

On using the given formula a Matrix will be formed and after calculating the determinant of the matrix we will obtain a quadratic equation in λ. With that, we can calculate two eigenvalues.

After this here comes the important step to calculate eigenvectors from these eigenvalues.

Now we will put the different eigenvalues in the previous equation to calculate different eigenvectors. Like this.

Calculating Eigenvectors

We will obtain two equations like above for 1 eigenvalue with which we can calculate a particular eigenvector. Like this, we can calculate two eigenvectors for two eigenvalues for this example. Then same eigenvectors are shown below in a matrix form.

Final eigenvectors obtained

Do we have reduced our Dimensions 🤔

Obviously No!! till now.

Till now we have only calculated Eigen Vectors. First, we need to calculate our Principal Component to reduce the dimension.

So how to calculate principal component?

For that, we first need to focus on eigenvalues. The eigenvalue which is larges has more importance so the vector regarding that particular eigenvalue will have more importance than the other one.

I this we have only taken two dimensions. But in a generalized way we can say that to reduce the dimension to k, we will take only top k important eigenvectors. Which will calculate our principal component. We will leave the rest one.

What to do after having the Principal Component?

With k eigenvectors, we have obtained our principal component or so-called Projection Matrix. Now just transform the d-dimensional input dataset X using the projection matrix to obtain the new k-dimensional feature subspace.

That all we have completed our dimensional reduction using PCA.

Where now?

As now you have learned dimensional reduction using PCA. Now I will advise you to get your hands a bit dirty and code this. I will discuss the code in my next blog post.

You also must know the TSNE algorithm for dimensionality reduction. Because PCA I considered a bit old algorithm. But TSNE is the newest and also highly used. I will Cover TSNE concepts in my future blogs.

Till then Happy Learning and Stay Safe !!

Please Clap if you like it 👏👏.

Also please have a look at my previous blog posts.

Yolo Object Detection Made Easy

P Value, T test, Chi Square test, ANOVA, When to use Which Strategy?

Understanding Hypothesis Testing for Data Science

--

--

Nitin Chauhan
Analytics Vidhya

Writer DDI & Analytics Vidya|| Data Science || IIIT Jabalpur