Feature Engineering: Common Methods to Select Features with Pros and Cons
In the first stage, it would be nice to start with a data visualization and data summary. We can check the data distribution and get a rough idea of what features might be completely useless. In this stage, usually it is a filtering process to filter out obvious useless and garbage features; however, for agnostic features, we should keep it and move to the second stage.
For the second stage of feature selections, we can start testing models. We should always start with the simple models, like linear regression or logistic regression. We need to make sure that the model is somewhat interpretable, or at least more interpretable than the actual model we will be using: e.g. Xgboost model is not very interpretable, but in compare with Neural Network, it is more interpretable.
If it is linear models that we implement, then the coefficient will be measuring the “usefulness” of the features. Because of the intrinsic strong interpretability in linear models, one can tell story easily with this approach. A confidence interval test can also be applied to support feature selection. The pros is thus that linear model is easy to explain and has statistical properties.
The cons for using linear models is that linear model ignores the interaction between features. That is, what if two variables do not contribute to the predictability of the model but in combine they are significant predictor?
For example, if we view sex orientation and major separately at UC Berkeley, the admission rate shows that male is much more likely to be enrolled. But if we look at pairs of major and gender, we will see that females are usually more likely to be enrolled. This is one example of the Simpson Paradox. To solve that, one can introduce interaction terms, which is products of pairs of features.
However, there raises another concern for this method. If we have N features, then a linear model with interaction will be O(N²). If we have 1 million features, imagine how many features we would need to calculate by including interaction terms.
If we are also testing ensemble tree-based models, then the model itself will not explicitly give a feature importance. Instead, it will outputs the average decrease of loss function when including the feature. Also, one can permute features to estimate the difference before and after permutation and compute the permutation feature importance. A SHAP model is also very powerful at explaining tree-based models.
So, in ensemble models, we can indirectly get the usefulness of features to be selected. For example, we can choose the top 10% of features based on feature importance.
The pros of tree-based models is that it considers feature interactions when building models. For instance, people from major A and gender B might be assigned to the same leaf in the end, in compare to the linear regression model where each is treated separately. So, we do not need to manually add those interaction features. However, ensemble methods are not as interpretable as linear models. Also, it may takes a long time to run given the number of features or observations is large.
Dimension Reduction 1: PCA
If some subset of features are sparse: i.e. do not contain much information, then we can reduce the dimensionality of those features so that the information density is intensified. PCA is one of the common methods to reduce dimension.
The PCA algorithm will do a linear transformation to project the features to a lower dimensional space. This transformation is done by looking at the covariance matrix of those features in the dataset. Then, it will determine new orthogonal basis vectors that correspond to the highest variance in the original vector space. Note, high variance means more information.
By applying PCA, 1000 features can be converted to 10 features. However, the meaning of those features is lost. For example, before transformation, the features are the age, gender, salary, and so on, but after the transformation, we do not really know what those vectors represent.
Dimension Reduction 2: Embedding
Word2Vec is a well-known embedding model that project words into 300 dimension vectors. One might think this increases the dimension; however, think how we can represent a word mathematically in a model. One-hot-encoding is an intuitive one. This means a word is actually a 10000~ vector if we consider the whole dictionary as the possible word inputs. With an embedding, the dimension is reduced to 300.
How does embedding work? On high level, we need input vectors and some labels for those vectors. Those input vectors are the vectors to be embedded. We will build a deep learning model to train to fit the vectors to the labels. Then, we use the hidden layer before the output layer as our embedding. It is this easy! Of course, there are many more technical details than what is mentioned.
One might ask: is embedding a feature selection method? Well, that is debatable. We can see it as one because it reduce dimensions, and feature selection is essentially a problem of reducing dimensionality.