Feature Selection — Filter Method

Zaur Rasulov
Analytics Vidhya

--

To research data easily, establish the models and obtain good results, it is important to preprocess data and one of the best methods to do this is Feature Selection.

What is Feature Selection?

The process of selecting and retaining the most important features is called Feature Selection. It assists to reduce the noise, computational cost of the model and sometimes to improve the performance of these models.

There are three methods for Feature Selection, namely:
· Filter method;
· Wrapper method;
· Embedded method.

Filter Method:

This method is generally used as preprossessing step. For this method, features are selected according to various statistical tests or based on the univariate metrics such as
· Variance;
· Correlation;
· Chi-square;
· Mutual information.

Different methods of correlation calculation exist and some popular of them are provided below.

· Pearson’s correlation: this method is used to calculate the correlation between continuous variables.

· Linear Discriminant Analysis: measures the correlation between continuous and categorical features.

· Chi-square: calculates the correlation between two or more categorical features.

· ANOVA: is the same as Linear Discriminant Analysis apart from the fact that there are two independent categorical features and one continuous dependent feature.

Advantages:

· Computationally very fast;
· Avoids overfitting;
· Do not depend on the models, but only features;
· Based on different statistical methods.

Disadvantages:

· Do not remove multicollinearity;
· Sometimes may fail in selection;

Filter method can be categorized into two groups, namely: univariate filter method and multivariate filter method.

In Univariate Filter Method, each feature is looked independently according to particular criteria, such as fisher score, mutual information and variance. Meanwhile, there is an important drawback. This method can select redundant features because of relation ignorance.

Multivariate Filter Method, on the other hand, can deal with redundance and at the same time is used to remove duplicates and correlated features.

Removing Constants features:

Constants are the features which contain only one value and thus they have no impact on the classification and modelling. It is recommended to delete them.

To remove the constants in Python, firstly, it is vital to split the given dataset into train and test datasets and then remove the constants as follows:

constant_features = [var for var in X_train.columns if X_train[var].std() == 0] 
X_train.drop(labels=constant_features, axis=1, inplace=True)
X_test.drop(labels=constant_features, axis=1, inplace=True) X_train.shape, X_test.shape

Removing Quasi-constant features:

Quasi-constant, almost constant, features have the same value for the largest part of dataset. These values are not useful during the prediction and the variance of threshold is not defined properly. However, generally, the 99% of constants repeating themselves can be removed.

To remove quasi-constants in Python, first of all, import the libraries and split the dataset to train and test sets. After that, start defining the functions to remove quasi-constants as follows:

# Define the threshold as 0.01
q_remover = VarianceThreshold(threshold=0.01)
# Find the values with low variance
q_remover.fit(X_train)
sum(q_remover.get_support())
# Apply to datasets
X_train = q_remover.transform(X_train)
X_test = q_remover.transform(X_test)
X_train.shape, X_test.shape

Removing Duplicate Features:

Duplicate features are the features that repeat themselves many times. These values have no impact on the dataset, but can delay the training time. Therefore, it is recommended to remove duplicate features.

duplFeatures = []
for i in range(0, len(X_train.columns)):
oneCol = X_train.columns[i]
for othCol in X_train.columns[i + 1:]:
if X_train[oneCol].equals(X_train[othCol]):
duplFeatures.append(othCol)
X_train.drop(labels=duplFeatures, axis=1, inplace=True)
X_test.drop(labels=duplFeatures, axis=1, inplace=True)

Removing Correlated Features:

One of the important terms in Statistics and Data Science is correlation. If the features are close in linear space, then the features are correlated. Thus, correlated features are imporant and should be observed. However, sometimes these features can create redundancy, as well. That is why, it is better to remove them.

To remove the duplicated features, the proper libraries should be imported, and the dataset should be splitted into train and test datasets. Afterwards, the code below can be used for removing duplicated features.

correl_Feat = set() 
correl_matrix = dataset.corr()

for i in range(len(corr_matrix.columns)):
for j in range(i):
if abs(correl_matrix.iloc[i, j]) > 0.8:
colName = correl_matrix.columns[i]
correl_Feat.add(colname)
X_train.drop(labels=correl_Feat, axis=1, inplace=True)
X_test.drop(labels=correl_Feat, axis=1, inplace=True)

In the code above, 0.8 is a threshold and can be defined in any way.

Feature selection is a very important term in Machine Learning, because it plays an enormous role in performance and and training any ML model. In this article, Filter Method of Feature Selection was described and some examples were provided.

The whole code can be observed on my Github Profile.

--

--