The mystery of n_components in PCA

Bhavesh Bhatt
Aug 25, 2017 · 3 min read
  • We often stop valuing things when they are given to us readily.
  • Something similar happens to us while solving a Machine Learning problem. scikit-learn pampers you so much that you stop thinking how an algorithm really functions.
  • One fine day, randomly someone asked me if you have a feature matrix of consisting of 20 features and if you plan to reduce the dimension, what is the n_components you will chose or how many optimum features will you use to represent the 20 dimensional data without having data loss?

So the best that you can do when you are clueless is smile and try to escape the situation which is what I did :)

So, the question still remains, how do you find the right number of features to represent a data set consisting of 20 features.

So, I decided to get my hands dirty!

Thought of starting off with the famous Iris Dataset!

Thank god those flowers were made and someone took so much amount of effort to record the length and width of the petals and sepals. Without the Iris data set Machine learning would have been so difficult.

  • It has 4 features and 150 entries.

PCA is as simple as the number of letters in PCA

  1. Compute the co-variance matrix
  2. Compute the eigen values and vectors.
  3. Project the data along the top eigen vectors based on the n_components.

Thats it, the job is done! But wait,

How do we reach to n_components??

So let me solve the mystery

  • Compute the eigen values and vectors
  • Arrange the eigen values in the descending order.
  • Since eigen values capture the variance by each component in the direction of the eigen vector. See the percentage of variance contributed by each feature which in turn contributes to the predicting power.

Clearly the variance captured by the 1st 2 features when arranged in descending order contribute to almost 96% of the total variance by the features. Thus we can eliminate the remaining 2 features as they don’t contribute much to the overall variance.

Thus, the mystery is solved.

n_components should be equal to the features which contribute a large number to the overall variance! The number depends on the business logic.

For the complete code, please visit -

https://github.com/bhattbhavesh91/pca_from_scratch_iris_dataset/blob/master/Principal_Component_Analysis_Iris_DataSet.ipynb

)

Bhavesh Bhatt

Written by

Google Developer Expert for Machine Learning https://www.youtube.com/BhaveshBhatt8791/

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade