Learning Feature Selection for Building and Improving your Machine Learning Model

Amit Maurya
Analytics Vidhya
Published in
7 min readJul 28, 2019
Link

Feature selection/creation/transformation is one of the most commonly overlooked areas in model building by aspiring Data Scientists.

Usually, the task of model building gets reduced to trying all sorts of fancy algorithms - from standard machine algorithm to Deep learning models. But, if we are going to feed garbage to our machine learning algorithm, garbage is going to come out of it (GIGO). So, the best thing to do before we move to select the best machine learning algorithm is to build a base line model- powered by a base algorithm or any go-to algorithm, then concentrate on improving the accuracy of this model using feature creation/transformation/selection.

In model building, feature selection/creation is a step where maximum time should be spent. Feature selection is somewhat easier than feature creation. Feature selection is a well-researched area and most of the Data Science algorithms offered under Python or R have automated this process.

Feature creation is a bigger dragon to slay. Smart features reduce training time and increase accuracy drastically. Creating features requires a lot of problem context. Expertise in the area of feature creation comes with practice and knowledge of the domain.

In this blog, I will explore both feature selection and creation using Python as medium. I will use the dataset at this link. The dataset is very simple. Our task is to build a model that can predict whether a person would be promoted or not. And few variables that we have with us are - region, department, education, gender, age, KPIs_met>80%, etc.

So, let’s start the analysis

Let’s first run the base model- Our base model is inclusive of all the columns given. It’s a XG Boost Classifier.

from xgboost import XGBClassifier
clf=XGBClassifier (). fit(x_train[cols], y_train)
print (classification_report(y_test,clf.predict(x_test[cols])))
print(matthews_corrcoef(y_test,clf.predict(x_test[cols])))
Results of the Base model- link

Above numbers will be our base numbers. We will try to improve these numbers through feature selection and feature creation.

There are broadly two types of variables- Categorical Variable and Numerical/Continues Variable.

Below are the categorical variables from the dataset-

Below are the numerical variables from the dataset-

Both type of variables has separate way of handling.

There are two major classes of categorical data- Nominal and Ordinal

**Nominal** — in this there is no concept of order. e.g. racial types- Asian, American, Europeans, etc. There is no order here. Representing Asians with 1 and Americans with 2 doesn’t mean one racial type is superior/inferior than other.

**Ordinal** — in this we have some sense of order among the values. e.g. Shoe sizes S, M, L, XL, XXL

**Handling Nominal Variables**- Few popular methods

1. Creation of Dummies

2. Mean Encoding

  • Dummy creation**- In this we create dummies for the all the unique value that the variable can take. Let’s say variable can take m unique values, then we create m-1 dummies.

Example- Let’s take column **recruitment channel**. There are in total 3 unique values. we will create in total 2 dummies. But the problem with this approach if number of unique values that the variable can take is high (high cardinality). It will increase the number of columns drastically.

  • Mean Encoding**- proportion of positive labels present for a value of a categorical variable. Problem with this methodology is overfitting. Mean encoding is one of the key transformations applied to categorical variables when we are using Gradient Boosting. Since, in Gradient Boosting tree height is monitored and always restricted to low tree height. As exposing tree to higher tree height leads to overfitting.
data1=pd.merge(data1, data1[['region','department','avg_training_score']]. groupby(['region','department']). mean(),how='left',on=['region','department'])data1=data1.rename(columns={'avg_training_score_x':'avg_training_score','avg_training_score_y':'mean_reg_dpt'})data1['new_avg_trng_score’] =data1['avg_training_score']/data1['mean_reg_dpt']

Domain specific- one must first check problem context and try to get as much information possible. For the given dataset we are provided with average training score. But, before we make any comments on this feature’s importance we should take a step back and think- How a person would be promoted in a multistate and multidepartment company? -

· Promotion would be region wise

· No. of promotion would be dependent on a department. Some departments would be, inherently, promoting a greater number of people than other

Once I took above mentioned factor into feature transformation of average training score, new average training score looked slightly better. (is_promoted=1- red density plot, is_promoted= 0- green density plot)

As we can see, engineered feature would be able to classify more accurately, as the region of non-overlap is more clearly defined.

**Binning Strategies and handling numeric variables**- Binning is of two types- **Fixed** and **Adaptive**

  • Fixed Binning** as the name suggest is fixed- boundaries are predefined, which may lead to imperfect bins with less irregular density in few bins
  • Adaptive Binning**- Quantile based binning is a good strategy to use for adaptive binning. Quantiles are specific values or cut-points which help in partitioning the continuous valued distribution of a specific numeric field into discrete contiguous bins or intervals. Thus, q-Quantiles help in partitioning a numeric attribute into q equal partitions. Popular examples of quantiles include the 2-Quantile known as the median which divides the data distribution into two equal bins, 4-Quantiles known as the quartiles which divide the data into 4 equal bins and 10-Quantiles also known as the deciles which create 10 equal width bins.
data1['age_bin'] = pd.qcut(data1['age'], q=[0,.10,.20,.30,.40,.50,.60,.70,.80,.90,1], labels=False)

Below I have done a comparative analysis- Pre (baseline model with no transformations) and Post (once few feature transformations has been done). We can clearly see that for positive class, f1 score has improved from .45 to .52. Overall accuracy marginally improved from 94% to 95%, as the classes are highly imbalanced.

In case of highly imbalanced class-say (Target Label 1–10 cases, Target Label 0–200 cases). If we make all our prediction as 0, we will still have an accuracy of 200/210 (95.2%) accuracy. In cases like this F1-score is good way to judge a model.

Feature selection-

As has been mentioned earlier in the blog, if we are going to put garbage through the machine learning algorithm, garbage is going to come out. So, feature selection becomes one of the key steps in modelling.

Reasons to use feature selection-

· It reduces the training time of machine learning algorithm

· Having less features in the model, makes it parsimonious and easily interpretable to human eyes

· Helps in improving the accuracy of the model and reduces case of overfitting

If you are building a regressive model, then standard statistical tests come in handy in pruning the model dimensionality and making it more accurate. One can do step-wise regression- wherein a modeller adds variable step by step and checks for F-Score; whether the additional variable added value to the model or not. Additionally, modeller can start with all the variables available and starts deleting variable one by one. But, based on my experience, I would recommend using domain knowledge to think before one deletes a variable. It may happen that sampling/data collection might be biased and has not been done properly, which is leading to a variable becoming insignificant in our analysis, as the evidence in support of the variable is too less.

Nevertheless, there are few standard statistical tests one can do to reduce feature set-

Coefficient test- If we can reject null (β=0) with high confidence, then the

variable is important for the model. Caution needs to be taken before one is rejecting a variable from model, as I have pointed earlier. Let’s say we are building model income and one of the variables, experience, turning out to be insignificant. But, we know based on the domain knowledge that years of experience must be one of the key independent variables in the model.

· Joint Hypothesis test-

R is place holder for SSR (sum of squared residuals) in the equation. R subscript for restricted and UR for unrestricted

equation on left is distributed F with q, n-k+1 degrees of freedom

Null hypothesis- X2 and X3 equal to zero

  • LDA: Linear discriminant analysis is used to find a linear combination of features that characterizes or separates two or more classes (or levels) of a categorical variable
  • Chi-Square: It is a is a statistical test applied to the groups of categorical features to evaluate the likelihood of correlation or association between them using their frequency distribution

Above analysis that I have done is for educational purposes only- to explore the available options that a budding Data Scientist has in his quiver. By no means the analysis is complete for the given dataset, neither do I claim that this would yield the best results. Model building is very subjective approach and a modeller may take different approach and still arrive at the same accuracy or improve it slightly.

The Complete code can be found at gihub.

--

--