How to not be dumb at applying Principal Component Analysis (PCA)?

Laurae: This post is an answer about how to use PCA properly. The initial post can be found at Kaggle. It answer three critical questions: what degree of information you allow yourself to lose, why truncating the PCA, and what should be fed in your machine learning algorithm if you intend to understand what you are working with.

DoctorH wrote:
Let’s say I have a classification problem. First, I will do some feature engineering, possibly using one hot encoding. This may mean that I end up with, say, 500 features. Presumably, the correct thing to do at this point is a PCA. But how?
Okay, here are some explicit questions:
Should the PCA be used merely for feature selection? In other words, should I look at the pearson correlation of the features with the first few PCA vectors, and let that guide which features to choose? Or perhaps it is better to forget the old features altogether, and train my algorithm on the PCA vectors?
When applying the algorithm that finds the PCA vectors, should I feed into it only the 500 features, or is it better to also feed into the category column (one hot encoded) as well? Obviously the test data doesn’t have a category column, but one can do the following: use the PCA vectors trained on the 500 features + the category column (one hot encoded), and then project the test data to the linear subspace spanned by the projection tof those vectors to the first 500 coordinates. Presumably that might be better, because then those vectors might detect patterns regarding what correlates with various categories, no? Do people do that sort of thing? Why is it a bad idea, if they don’t?

Answering point by point your questions.

DoctorH wrote:
1. Let’s say I have a classification problem. First, I will do some feature engineering, possibly using one hot encoding. This may mean that I end up with, say, 500 features. Presumably, the correct thing to do at this point is a PCA. But how?

Not always. The question remains: it depends on the objective you have. If:

  • You are looking for maximum performance: you take all PCA and initial features and feed through a L1 regularization to do “fast” feature selection, or you use any other feature selection method you like. You can also take the first principal components (like: top 95% variance).
  • You are looking for maximum interpretability: do not use PCA unless your data is in a good shape afterwards. See picture below.
DoctorH wrote:
2. Should the PCA be used merely for feature selection? In other words, should I look at the pearson correlation of the features with the first few PCA vectors, and let that guide which features to choose? Or perhaps it is better to forget the old features altogether, and train my algorithm on the PCA vectors?

Yes and no. Principal components are all uncorrelated to each other (correlation = 0). Higher variance on a localized (lower) amount of variables does not mean it is better. See the picture below.

In any case, it depends on the machine learning algorithm you are going to apply. For instance:

  • If you are going to apply a non-correlation robust algorithm (ex: LDA, Linear Regression…) : you must clear out all high correlations which might shut down the performance of the algorithm, and also clear out all the correlation chains (i.e break your one-hot encoded feature once: remove one column from the final encoding). Or you just use all PCA vectors.
  • If you are going to apply a correlation robust algorithm (ex: Random Forests, xgboost…): you do not need to care about correlation.
DoctorH wrote:
3. When applying the algorithm that finds the PCA vectors, should I feed into it only the 500 features, or is it better to also feed into the category column (one hot encoded) as well? Obviously the test data doesn’t have a category column, but one can do the following: use the PCA vectors trained on the 500 features + the category column (one hot encoded), and then project the test data to the linear subspace spanned by the projection tof those vectors to the first 500 coordinates. Presumably that might be better, because then those vectors might detect patterns regarding what correlates with various categories, no? Do people do that sort of thing? Why is it a bad idea, if they don’t?
When applying the algorithm that finds the PCA vectors, should I feed into it only the 500 features, or is it better to also feed into the category column (one hot encoded) as well?

PCA looks at variance. If you do not standardize your features, they will have different weights in the PCA. As a good starting point, it is common to standardize to {mean, variance} = {0, 1}, thus {mean, std} = {0, 1}.

If we assume your category column is 200 columns long, the total of these 200 columns must have the same weight in the PCA as one other column. Therefore, you would standardize these 200 columns to {mean, variance} = {0, 1/200} = {0, 0.005} and {mean, std} = {0, ~0.0707}. Hence, these 200 columns from 1 column would have the same weight as one other column.

Presumably that might be better, because then those vectors might detect patterns regarding what correlates with various categories, no?

Yes and no. Check picture below.

Do people do that sort of thing? Why is it a bad idea, if they don’t?

Yes and no. It all started in the literature, and some researchers warned it is not always a good idea and you must check what you have after you used the transformation. This also applies to similar methods, such as Independent Component Analysis, MCA, FA… The best picture to understand why is below (for the fourth time :p ).