How to break the “Curse of Dimensionality”?
Less is more | Feature selection in ML
More the data, better the analysis. Or, is that a myth? Time and again, we keep hearing the term — ‘curse of dimensionality’. Why is having more dimensions (more features in a dataset) a curse? Let’s take a deep dive!
Let’s say, we have 25 observations and only one feature, that is, there is only one attribute that define these observations. We plot these 25 observations as points in a plane. Since there is only one feature, there will be only one dimension. From the image below, we can see that the points in the 1D space are tightly packed.
Keeping the number of observations constant, if we increase the number of features to two, the points become sparse in the 2D space. The sparsity increases with the addition of the third feature. You can imagine what might happen when there are 100 or more dimensions.
How does the exponential increase in space become the curse?
The data is usually split into train and test data. Model is built on the train data and evaluated on the test data, that is, the model’s performance on unseen data is measured.
Assume we are building a model for employee attrition and the only feature is gender. We will need at-least two samples in the train data (one for each category — female and male), so that the model can accurately predict when the same category is encountered in the test data. Now, we add another feature, age group (with two categories, 25–40 years, 40–60 years). If we need a minimum of one sample for each combination (male & 25–40 years, male & 40–60 years, female & 25–40 years, female & 40–60 years), we will need at-least four samples in the train data. With a third feature, say, job level with three categories, we will require 12 (2*2*3) samples in the train data.
Thus, the sample size has to increase with increasing number of features, so that the model is reasonably accurate. If there are 100+ features, there should be enough samples to explain each feature. In the sense, if the model is trained only on select combination of features, it is possible that the model may not predict the outcome accurately, when a new/less frequent feature combination(s) is/are fed to the model. This is a typical scenario of overfitting.
The distance based algorithms like kNN, Clustering, etc. would consider each record as equidistant or each sample will be a unique combination of features (due to the under-representation).
The more features we have, the more number of samples we will need, to have all combinations of feature values well represented in our sample.
The scenario of having more features than the number of samples (assuming each sample explains a feature) is called the Curse of Dimensionality.
What’s the cure?
Well, the obvious and the straight forward answer to this, is to increase the number of samples. What if the dimensions are so huge, that adding more samples is going to make the model complex? The next best option is to reduce dimensions.
While a large dimension poses a threat to model performance (overfitting), a very low dimension doesn’t help the model to learn the underlying pattern in the data (underfitting). Know more about over-fitting & under-fitting (bias variance trade off) from here.
What’s the trade-off between Bias and Variance?
“It is better to be approximately right rather than precisely wrong.”- Warren Buffet
The model performance increases with an increase in the number of features. It reaches a peak accuracy with the optimal number of features. The performance drops thereafter. This is called Hughes Phenomenon.
The following are the two ways to reduce dimensions.
#1 — Feature Extraction :- This is a technique for reducing the dimensions, in which new set of features are created as a combination of the original features, such that, number of new features < number of original features. Principal Component Analysis (PCA) & Factor Analysis are examples of feature extraction. Check out my blog on a detailed explanation of PCA.
Multi-Dimensional Data— boon or bane?
The ‘why, what & how’ of PCA (Principal Component Analysis)
#2 — Feature Selection :- This technique involves selecting a subset of relevant/useful features from the original feature space. There are several ways by which the features can be selected. This blog talks about some of the common methods in feature selection.
Feature Selection methods
#1 — Filter methods :-
Filter methods are part of the preprocessing steps and are independent of the ML algorithm used. These methods can be further classified as unsupervised and supervised, depending on whether the target variable is considered for the feature selection.
→ Unsupervised Filter methods :-
The target variable is not considered for these methods. These methods are useful in removing the redundant/useless variables.
- Missing values ratio :- The features that have several missing values, that is, if the ratio of missing values is more than a threshold, then such features do not convey any information and hence, may be dropped before model building. The threshold may be chosen depending on the data and the domain.
- Low Variance :- The features that have a very small or negligible variance can be considered as constants. Without much variation, these features carry only little or no information. An appropriate threshold may be chosen, basis which, the low variance features can be dropped. Normalizing the values may be ideal, since the variance is dependent on the range of the features and is expected to be different for each feature.
- High Correlation :- The features that have a high correlation (linear relationship) with each other typically convey similar information. Such features become redundant. For example, as age increases, job level increases, so does the salary. One or more of these features may be dropped.
→ Supervised Filter methods :-
The target variable is considered for these methods. These methods are useful in removing the irrelevant variables, by measuring the relationship between the independent features and the target variable.
Mutual Information is the amount of information obtained about the target variable, through the independent feature. The features with maximum mutual information are selected as important/relevant features.
- Functions for feature selection using mutual information — mutual_info_classif, mutual_info_regression
The p-values could also be calculated for the independent features to determine their significance. Higher the p-value, lower is the significance. The least significant feature(s) may be dropped.
- Techniques for feature selection using p-values — ANOVA (Analysis Of Variance), Chi-square, Pearson’s Coefficient
#2 — Wrapper methods :-
A subset of features are used to train a model. Basis the model performance on out-of-sample data, features are added or eliminated and subsequent models are built.
The following are some of the wrapper methods that are popularly used.
- Forward Selection :- An iterative process of model building which starts with one feature in the initial model. An additional feature that best improves the model is added before building the model again. This process continues until the addition of a new feature does not improve the model performance.
- Backward Selection :- This is similar to the forward selection, except the initial model is built with all the features. The least significant feature is removed at every iterative step, until the model performance does not improve.
- Recursive Feature Elimination :- This uses an external estimator, such as Linear Regression or Random Forest, depending on the type of supervised learning. Initial model is built with all the features. Basis the values of the coefficients or the feature importance, the least significant feature(s) are determined. The next model is built after removing the least significant feature(s). Only the best model from these is chosen. This best model is used for comparison with the model built in the consecutive step. Since, it looks at the best model, only for the given step, this algorithm is also called as a greedy optimization algorithm. The size of the feature subspace (the optimal number of features to be selected) is a hyperparameter that can be tuned to improve model performance. The number of features to be removed at each step is also a hyperparameter. By default, only the least significant feature is removed. This iterative process continues until the optimal number of features (hyperparameter) is reached.
#3 — Embedded methods :-
These methods combine the qualities of filter and wrapper methods. When an insignificant feature is removed (in the wrapper methods), it primarily means that the co-efficient of that feature becomes zero. Instead of removing the features, the embedded methods keep the insignificant variables, however, adds a penalty to their coefficients, thereby reducing its importance.
Ridge Regression (penalizes the sum of squared coefficients) and Lasso Regression (penalizes the sum of absolute value of coefficients) are popular examples of embedded methods.
Why do we need to reduce dimension (in a high dimensional dataset)?
Lesser (optimal) number of features will …
- Have lesser training time.
- Reduce model complexity and hence, more interpretability.
- Reduce the possibility of overfitting.
- Improve accuracy.