The best feature selection technique for text classification

Amruthjithraj V.R
Analytics Vidhya
Published in
3 min readOct 14, 2020

A simple code for feature selection!

Before we jump into the code let's first understand a few things about feature selection

  1. What is Feature selection?

Feature selection is the process of reducing the number of input variables when developing a predictive model. It is desirable to reduce the number of input variables to both reduce the computational cost of modeling and, in some cases, to improve the performance of the model.

2. How does feature selection work?

Feature Selection is the process where you automatically or manually select those features which contribute most to your prediction variable or output in which you are interested in. Having irrelevant features in your data can decrease the accuracy of the models and make your model learn based on irrelevant features.

3. Importance of feature selection in text classification.

Feature selection is one of the most important steps in the field of text classification. As text data mostly have high dimensionality problems. To reduce the curse of high dimensionality, feature selection techniques are used. The basic idea behind feature selection is to keep only important features and remove less contributing features.

Issues associated with high dimensionality are as follows:

1. Adds unnecessary noise to the model

2. High space and time complexity

3. Overfitting

The feature selection technique we will talk about today is the Chi-Square feature selection.

The Chi-square test is used in statistics to test the independence of two events. More specifically in feature selection, we use it to test whether the occurrence of a specific term and the occurrence of a specific class are independent.

Without any further ado, let's jump into the code

# Load librariesfrom sklearn.feature_selection import SelectKBestfrom sklearn.feature_selection import chi2# N features with highest chi-squared statistics are selectedchi2_features = SelectKBest(chi2, k = can be any number)X = chi2_features.fit_transform(X, y)

The first two lines of the code are just importing the packages needed for the chi-square feature selection. SelectKBest function is used for selecting the K number of top features based on the Chi-square score. K can be any number depending on the number of features you are dealing with.

For example, if you have 50,000 features(columns) after creating a bag of words. In such a case, you should try keeping the K value from 40,000 to 10,000 and check which value gives the best results. Once you find the optimal number that gives the best accuracy you can finally set it as the default K value.

This article is for people who are starting with NLP and are stuck with the question of which feature selection technique to use and how to implement it. Feature selection for text cleaning can be a headache in most cases. This code can help you with the most basic feature selection techniques for text cleaning and can be used straight away.

Thank you for reading; Please consider flowing for more such blogs!

I hope you learned something new!

Cheers.

--

--