Feature Selection — Using Genetic Algorithm

Dr. Samiran Bera (PhD)
Analytics Vidhya
Published in
3 min readJul 20, 2020

Let’s combine the power of Prescriptive and Predictive Analytics

Source: analyticsvidhya.com

All Machine Learning models use a large volume of data to train to predict the patterns in the future. It implies that the machine learning models are susceptible to the quality of data. Even with minute errors, the models may train to yield infeasible or inferior results. Thus, the quality of data used for training is the utmost concern of an organization.

In this direction, feature selection plays a crucial role. Different techniques are present such as forwards selection, backward elimination, stepwise selection, etc. to select a feature set. However, most of these approaches are performed manually and are computationally expensive and time-consuming. Therefore, in this article, the Genetic Algorithm is used to obtain an optimal feature within a reasonable amount of time.

The structure of this article is as follows.

  • Feature Selection — What is it? And why do we need it?
  • Genetic Algorithm — What is it? And why do we need it?
  • Genetic Algorithm in Feature Selection — How to do it?

Feature Selection: Lets cut the clutter

Feature Selection is the process in Data Wrangling, where certain features that contribute most to the Target Variable are selected. Learning from irrelevant features in the data can decrease the Accuracy and Performance of the model.

The most common way to remove irrelevant features is through Univariate Selection, by Feature Importance and using Correlation Matrix. This article provides an excellent overview of feature selection.

To proceed with feature selection, certain preprocessing steps such as missing value imputation, removing outliers, dropping irrelevant features, identifying the feature set, and the target variable.

The python code to preprocess data is provided below using the dataset Heart Disease Dataset.

Genetic Algorithm: The popular meta-heuristics

Genetic Algorithm (GA) is one of the most popular Evolutionary Algorithms (EA) used by experts from academia and industry. GA uses three operators: selection, crossover & mutation to improve the quality of solutions. This article provides a clear understanding of GA and how to implement operators on a Transportation Problem.

The Genetic Algorithm can be used in several kinds of problems with very little modification, thus provides development flexibility at a very low cost. Further, the Genetic Algorithm has fewer constraints (such as parameter tuning) compared to other meta-heuristics techniques and thus can obtain an optimal solution efficiently. The python code for basic Genetic Algorithm operators is provided below.

Genetic Algorithm for Feature Selection

To implement the Genetic Algorithm for Feature Selection, the accuracy of the predictive model is considered as the fitness of the solution, where the accuracy of the model is obtained by using the Logistic Regression.

For each chromosome, the predictive_model() function evaluates the accuracy score, which is aggregated at get_fitness() function for the entire population. The python code for the fitness function of the predictive model is provided below.

The main program is executed using the ga() function. This function takes a list of the following arguments:

  • data: The dataset used in the study
  • feature_list: Aset of features which need to be optimized
  • target: Denotes the dependent variable
  • n: The size of the population
  • max_iter: The number of iterations to evaluate

The function returns the optimal setting of feature selection as a binary array with the best accuracy score.

The Big Bonus

Executing the above program will produce an optimal set of features. However, it is interesting to observe that more than one optimal feature set can be present, i.e., a problem with Multiple-Optimal solutions. This can be observed from the outputs obtained by executing the following code.

In the first output, seven features were selected, which provides the optimal accuracy score of 90%.

OUTPUT 1:
Optimal Feature Set
['thalach', 'cp_4', 'restecg_1', 'exang_1', 'ca_2', 'thal_6', 'thal_7']
Optimal Accuracy = 90.0 %

And in the second output, twelve features were selected, which provides the optimal accuracy score of 90%.

OUTPUT 2:
Optimal Feature Set
['Age', 'chol', 'oldpeak', 'Sex_1', 'cp_4', 'fbs_1', 'restecg_2', 'exang_1', 'slope_3', 'ca_1', 'thal_6', 'thal_7']
Optimal Accuracy = 90.0 %

Both feature set gives an accuracy of 90%. Thus either one can be used. But, the selection of a feature set combination also requires knowledge and experience in the business process. Thus, even with the help of the Genetic Algorithm to filter the best feature set, it is always good to make decisions based on business objectives rather than simply building a high accuracy model.

--

--

Dr. Samiran Bera (PhD)
Analytics Vidhya

Senior Data Scientist | PhD | Machine Learning & Optimisation