Member-only story
How to Select the Best Number of Principal Components for the Dataset
Six methods you should follow
Selecting the best number of principal components is the major challenge when applying Principal Component Analysis (PCA) to the dataset.
In technical terms, selecting the best number of principal components is called a type of hyperparameter tuning process in which we select the optimal value for the hyperparameter n_components in the Scikit-learn PCA() class.
from sklearn.decomposition import PCA
pca = PCA(n_components=?)
In other words, when we apply PCA to the original dataset with p number of variables to get a transformed dataset with k number of variables (principal components), n_components is equal to k, where the value of k is much less than the value of p.
Since n_components is a hyperparameter, it does not learn from the data. We have to manually specify its value (tune the hyperparameter) before we run the PCA() function.
There is no magic rule behind selecting the optimal number for n_components. It depends on what we really want from PCA. Some visual inspection and domain knowledge may also be helpful to deduce the right value for n_components.