Natural Order vs Predictive Power
The natural order of transformed features doesn’t necessarily map to their predictive power.
A typical process for training a supervised learning model looks like this:
Oftentimes the original features are not the optimal representation of the underlying phenomena and consequently people may opt for applying a feature transformation on the original feature space as a pre-processing step before model training. The supervised learning pipeline would then look like this:
Transformations of the feature space typically come with a natural ordering of the transformed features. For instance, the commonly used Principal Component Analysis (PCA) produces transformed features (or principal components) that are sorted according to how much of the original dataset’s variance is captured by a principal component. Accordingly, the first principal component captures more of the original dataset’s variance than the second component and so on.
Feature selection is a staple of the machine learning process and aims to increase model performance and reduce model complexity by eliminating features that do not provide much value to the model. Intuitively, feature selection eliminates features that are either irrelevant to the predictive task at hand or features that are redundant. The natural ordering of transformed features serves as a good heuristic for eliminating features. For instance, a common practice is to include principal components that explain the bulk majority of the variance in the original dataset (say 98%) and discard the rest. Nice and simple, right? or is it?
As it turns out, there is no guarantee that the natural ordering of principal components (for instance according to the amount of variance they explain) will match the ordering of principal components according to their predictive power (how much do they make the model better at predicting the target variable). It could very well be the case that a higher principal component PCj has more predictive power than a lower principal component PCi (where i < j) even though PCi explains more of the data’s variance than PCj.
The key intuition behind this observation is that PCA (and other statistical feature transformation techniques) are unsupervised statistical learning procedures where no particular target feature is being optimized for, rather the model is optimizing for a better representation of the underlying statistical structure of the data (according to some metric). In short, the target variable of the overarching prediction process is irrelevant to the feature transformation procedure.
In their brilliant book “Applied Predictive Modeling”, Max Kuhn and Kjell Johnson give a simple example of a binary classification task where, on a particular (real) dataset, the second principal component PC2 (in this case accounting for only about 8% of the variance in the data) separates the two possible outcome classes in a much better manner than the first principal component PC1 (which in this case accounts for 92% of the variance in the data). The difference in predictive power can be seen clearly from the box plots below: PC2 exhibits a lot less overlap between the two classes than PC1 and consequently PC2 has more predictive power than PC1. Find the book here and read more about this particular case on the authors’ blog here.
The take home message here is: a transformed feature’s predictive power doesn’t necessarily map to it’s natural importance in the transformed feature space. The way to go is to compute some metric of feature importance for each of the transformed features and then use that as an indicator for feature selection. Of course on planet earth, people don’t often analyze individual features in isolation. All sorts of feature interactions come into play especially with linear methods like PCA. However, the message is still a good one to keep in mind.
How this post came to be? Every week at Optima, everyone on the team gets five minutes or so to share a “nugget” of data science, algorithms or related knowledge. The only rule is that it can be explained and grasped in 5 to 10 minutes. Lately we decided to share these nuggets with the world. So here we are.