Dimensionality Reduction- How to deal with the features in your dataset (Part 1).
Real World Data is messy and mostly contains unwanted and redundant features. These redundant features often make it very difficult for our predictive models to work as expected. So dimensionality reduction, the process of eliminating, including and transforming existing features becomes a crucial step in data pre processing.
Once we have concise and relevant data it helps us in
- Better visualization and exploration of the data set
- Occupying less space in memory
- Reducing the complexity of the predictive model and makes it much easier to interpret
- Reducing Overfitting
- Improving the performance of the model by choosing the right features
The process can be broadly classified into 2 ways:
- Feature Selection-The process of excluding or including attributes present in the data without changing them.
- Feature Extraction-The process of creating new combinations of attributes by doing some transformation on the existing attributes
In this story we will look at the Feature Selection Part. Feature Extraction will be discussed in Part 2 of this story.
Feature Selection is divided into 3 main categories.
- Filter Method
- 2.Wrapper Method
- 3.Embedded Method.
We will discuss each in detail:
1.Filter Methods
Here the most relevant features are selected based on their correlation with the target, the uniqueness of the features,their statistical significance. The ML algorithm used to train the dataset is not involved while selecting the features.
Some of the filter methods used frequently:
a. Missing Value Ratio : Often we come across data where certain columns have many missing values. If the columns is mostly empty it does not provide enough relevant information. So we can find the missing value ratio of each column and remove columns which does not meet the agreed threshold ratio.
b. Low Variance Filter : If the spread of data in a column is mostly homogeneous then its variance tends to zero. Such columns do not contribute much in predicting the target variable. So after deciding upon a threshold variance value, we can eliminate columns whose variance falls below that. But variance is dependent on the spread/range of the data. Hence its important to normalize the data before applying this method.
c. Information gain : This helps in measuring the dependency between 2 variables.It is mainly used in Classification problem. IG looks at each feature and measure how important is that feature in classifying the target variable.It involves the measure of entropy. (Information Gain= 1-Entropy). Higher the information gain, better is the classification.
Let us consider a dataset with 2 features x1 and x2 and a target variable y with values 0s and 1s. Let us classify y twice once on the basis of x1 and once on the basis of x2.
Let the percentage of 1s and 0s for a particular split is p1 and p2, hence the entropy for that split would be
Lets find the entropy and Information gain for each split of y
Here we see that the IG is greater for the split on x2 and the classification is also better. Thus x2 is a better feature and should be selected.
d. Pearson’s Correlation : This measures the dependency of a target column with continuous values on another column also containing continuous values. It measures the linear association between 2 variables.
If the value is near ± 1, then it said to be a perfect correlation: as one variable increases, the other variable tends to also increase (if positive) or decrease (if negative).
When the value is near zero the variables are said to have no correlation. More on this here.
e. ANOVA : This measures the dependency of a target column with continuous values on another column containing categorical values.Before we discuss first I will urge readers to understand the concept of ANOVA from here.
Let us consider a categorical feature x with possible values A, B and C and and a target column y with various continuous values. Now we will group these continuous target values y by the categories of the feature x
After grouping the values of y by the categories of x we find the table to be like this
Now ANOVA will determine whether the mean of each of these groups (A,B,C) of Y values are basically equal (Null Hypothesis)or if there is significance difference between them(Null Hypothesis).
If our Null Hypothesis is true then we will conclude that categorical feature X does not have any influence on Y.
Else if the Null Hypothesis is rejected then we will conclude that the different categories of X feature does influence Y and hence should be selection in our feature selection technique.
- If Statistic < Critical Value: not significant result, do not reject null hypothesis (Ho), independent.
- If Statistic >= Critical Value: significant result, reject null hypothesis (Ho), dependent.
f. Chi square : The Chi square test determines if there is a significant relationship between two categorical variables. You can go through these to understand Chi Square distribution and how to calculate a Chi square statistic. Now how can we use this for feature selection in Machine learning?
It basically determines if the frequency distribution of different groups(Male,Female) in a particular categorical variable X (sex) across different categories (Science, Art, Maths) of another variable Y (interest) are same or not.
Here we have 2 frequency distribution (male and female) grouped by Art Science and Maths. So we will determine whether these 2 sets of frequency are equal (Null Hypothesis) or there is significant difference between them (Alternative Hypothesis)
- If Statistic < Critical Value: not significant result, do not reject null hypothesis (Ho), independent.
- If Statistic >= Critical Value: significant result, reject null hypothesis (Ho), dependent.
2. Wrapper Methods
Wrapper methods are based on greedy search algorithms as they evaluate all possible combinations of the features and select the combination that produces the best result for a specific machine learning algorithm. It iteratively selects or discards features from a subset based on the performance of the algorithm. It tests all possible combinations and hence is computationally expensive.
Some of the most common Wrapper methods:
a. Forward Feature Selection
Here , features are selected one by one starting with 1 feature.
In the first phase the algorithm is trained with each feature.The best among it is selected.
In the second phase that feature is taken with combination with other features and the best combination of 2 is selected.
This continues until the best combination of required number of feature is selected.
b. Recursive Feature Elimination
The recursive feature elimination process starts with all the features in the dataset. It eliminates each of the features once in a round robin way and evaluate the performance on the remaining subset. The best performing subset is selected.
With this selected (#of features-1)subset, each of the remaining feature is removed once and the performance is evaluated. The best performing feature subset(#of features-2) is selected.
This process continues until we get the best performing feature subset meeting the required criteria
3. Embedded Methods
Having more features sometimes it will increase the noise. The model might end up memorizing the noise instead of learning the trend of the data.The inaccuracies can lead to a low-quality model if not trained carefully. This is called overfitting
The main concept behind avoiding overfit is simplifying the models as much as possible. Simple models do not (usually) overfit. On the other hand, we need to pay attention the to gentle trade-off between overfitting and underfitting a model. This is acheived through Regularization.
The basic idea behind regularization can be understood as penalizing the loss function for higher value of learned weights(w). This prevents some of the features to increase exponentially and cause overfit.
Lets understand this in detail
Let us consider a record in the training space having
· n features represented by x[0],x[1],x[2]..x[n].
· The learned parameters or weights be w[0],w[1],w[2]…w[n].
· The target value is y
Let the predicted value be
Now the loss function can be defined as:
Our whole motive is to minimize the loss function defined in 1.2.
So if due to some feature x[j] the corresponding weight w[j] explode,this might lead to overfit.To avoid this we must penalize the loss function for those exploding weights. That can be done as follows:
here we add the regularization parameter λ along with the weights so that while minimizing the cost function, some of the weights are shrinked making the model less complex.
a. Ridge Regression
In ridge regression the square of the weights w[j] are taken along with λ.
Due to this,by changing λ the weights are regularized and shrinked but never reaches 0.
More on Ridge Regression here
b. Lasso Regression
In Lasso the absolute vaIue of the weights w[j] are taken along with λ.
This can lead to zero weights i.e. some of the features are completely neglected for the evaluation of output thus absolutely eliminating some features.
More on Lasso Regression here.
So we have almost covered major dimension reduction techniques which gets rid of or shrink less importent features to create a simplier model. The topic of extraction of new feature will be taken up later.