TDS Archive

An archive of data science, data analytics, data engineering, machine learning, and artificial intelligence writing from the former Towards Data Science Medium publication.

Member-only story

How to Select the Best Number of Principal Components for the Dataset

8 min readApr 24, 2022

--

Photo by Randy Fath on Unsplash

Selecting the best number of principal components is the major challenge when applying Principal Component Analysis (PCA) to the dataset.

In technical terms, selecting the best number of principal components is called a type of hyperparameter tuning process in which we select the optimal value for the hyperparameter n_components in the Scikit-learn PCA() class.

from sklearn.decomposition import PCA
pca = PCA(n_components=?)

In other words, when we apply PCA to the original dataset with p number of variables to get a transformed dataset with k number of variables (principal components), n_components is equal to k, where the value of k is much less than the value of p.

Since n_components is a hyperparameter, it does not learn from the data. We have to manually specify its value (tune the hyperparameter) before we run the PCA() function.

There is no magic rule behind selecting the optimal number for n_components. It depends on what we really want from PCA. Some visual inspection and domain knowledge may also be helpful to deduce the right value for n_components.

--

--

TDS Archive
TDS Archive

Published in TDS Archive

An archive of data science, data analytics, data engineering, machine learning, and artificial intelligence writing from the former Towards Data Science Medium publication.

Rukshan Pramoditha
Rukshan Pramoditha

Written by Rukshan Pramoditha

3,000,000+ Views | BSc in Stats (University of Colombo, Sri Lanka) | Top 50 Data Science, AI/ML Technical Writer on Medium

Responses (2)