Feature Selection & Feature Engineering

Utkarsh Kulshrestha
Analytics Vidhya
Published in
5 min readMay 15, 2020

Features also known as Dimensions, Independent Variables, Columns in Data Science perspective. Selection and Processing of these features is one of the foremost part of any Machine Learning Model Building. The various methodologies are available to play with features which in turns directly affect the Accuracy, Recall, Precision etc. for a Machine Learning Model. Idea here is to understand these techniques in detail to make a better model.

Feature Selection

Feature Selection is basically the methods applied over the given dataset to identify the potential or important features from the dataset which are having high impact on dependent variable. Benefits of Feature Selection are Optimized Dataset, Low Memory Consumption, Improved Model Performance etc.

  • Domain Expertise : This is a manual method of identifying the important variables within a data by the help of a domain or business expert. Here, A Domain Expert can give a glimpse about the business where an analyst can understand and identify the important features or if an analyst is fortunate enough then they might get the exact important features needed for the model building from the domain expert itself.
  • Missing Values : The Curious Case of Missing values. So whichever feature column contains more than 40% to 50% missing values those feature columns can be directly dropped off because imputation by using the Central Tendency Theorem will not help here. The remaining features will be used for modeling purpose.
  • Correlation : When the independent features are correlated to each other then a problem of collinearity arises. In Simple terms, correlation disturbs the balance of weights generated by the model at the time of training. Correlation Coefficient is calculated by the Pearson’s r formula:
Pearson’s R Coefficient for Correlation

The range of R lies between -1 to +1 where -1 indicates a perfectly negative correlation, Zero indicates no correlation and +1 indicates a perfectly positive correlation.

Correlation Coefficient Interpretation
Correlation Matrix Heat map
  • Multi-collinearity : Multi-Collinearity is an another form of correlation where the values lies between minus infinity to plus infinity which can affect the Regression Model coefficient calculations haevily . This can be checked by VIF (Variation Inflation Factor) which can be interpreted as:

1 = Not Correlated; 1 to 5 = Moderately Correlated; >5 = Highly Correlated

  • Zero-Variance Check : If any of the features are having Zero Variance then those features needs to be dropped because these features doesn’t allow the model to generalize over the data or do not provide any relationship between independent and dependent variables.
  • Feature Derivation : Features can be derived from the existing feature also called as feature split. For ex, A column consisting of Time&Date (dd/mm/yy hh:mm:ss) can be split into two separate features as date and time. Similarly two features can be clubbed to generate a new single feature instead of two. For ex, one column contains the Selling Price & another contains the Cost Price so a new feature called profit can be generated by clubbing these two features and can be directly used for model building instead of two features.

Feature Engineering

Feature Engineering is basically the methodologies applied over the features to process them in a certain way where a particular Machine Learning model will be able to consume them. Feature Engineering or Feature Processing helps reduce Over-fitting, Under-fitting, Weight Optimizations etc.

  • Normalization : This is a method to bring the data onto a particular scale or Normalization usually means to scale a variable to have a values between 0 and 1.
min-max scalar for Normalization
  • Standardization : This is again a most commonly used feature scaling technique which normalize the data in such a way where the newly generated distribution will have Mean 0 and Standard Deviation as 1. The formula for standardization is also called as Z-Score formula which converts a Normal Distribution into a Standard Normal Distribution.
Z-Score

Both Normalization & Standardization are the feature scaling techniques but the care must be taken while selecting one of them as Normalization is affected by the outliers whereas Standardization is unaffected of outliers and considered as a favorable scaling technique within the Analytics industry.

  • Log Transformations : This is another technique for data normalization. It brings down the data into a particular range. These transformations are generally used for the skewed data where the higher magnitude confines toward the center.
Log Transformation
  • Feature Encoding : Feature Encoding is used to convert the categorical features into the numerical values by using various encoding techniques. Most commonly used feature encoding methodologies used in the Analytics industry are One-Hot Encoding, Orthogonal Encoding, Dummy Encoding etc.
One-Hot Encoding
Orthogonal Encoding
  • Principal Component Analysis : Commonly known as PCA is a technique used for dimension reduction. It takes all the features correlated or uncorrelated and generate the new features called as Principal Components which are orthogonal to each other or uncorrelated in nature. PCA should be only used when the number of features are more in number.
  • Data Imputation : Missing values in the columns can be imputed by using Central Tendency Theorem. For Numerical features Mean and Median can be used while for Categorical features Mode can be used to impute the null values. Apart from these few other techniques are also available like Zero Fill, Backward Filling, Forward Filling etc.

--

--