Correlation-based Feature Selection in a Data Science Project

Sariq Sahazada
3 min readMar 24, 2023

--

In a data science project, feature selection is essential to identify the most relevant variables contributing to a model’s performance. Feature selection can help reduce the number of variables, improve model accuracy, and decrease over-fitting. In this blog, we will explore one of the most popular feature selection techniques, Correlation-based Feature Selection.

Correlation-based Feature Selection

Correlation-based Feature Selection (CFS) is a feature selection technique that selects subsets of features that are highly correlated with the target variable but have low correlation with each other. The idea behind CFS is to identify a subset of features that can provide the maximum amount of information about the target variable while minimizing redundancy among the features.

The CFS algorithm works by first calculating the correlation between each feature and the target variable. Then, it computes the correlation between each pair of features. Next, it selects the subset of features that has the highest correlation with the target variable and the lowest correlation with each other.

CFS can be applied to both regression and classification problems. For regression problems, the correlation between each feature and the target variable is calculated using Pearson’s correlation coefficient. For classification problems, the correlation is calculated using the Symmetrical Uncertainty measure, which takes into account the mutual information between the features and the target variable.

Let’s illustrate CFS using a code example. We will use the Breast Cancer Wisconsin (Diagnostic) dataset, which contains 569 samples of malignant and benign tumor cells. The goal is to predict whether a tumor is malignant or benign based on features such as radius, texture, and perimeter.

First, let’s import the necessary libraries and load the dataset:

import pandas as pd
import numpy as np
from sklearn.datasets import load_breast_cancer

data = load_breast_cancer()
df = pd.DataFrame(data.data, columns=data.feature_names)
df['target'] = pd.Series(data.target)

Next, let’s calculate the correlation matrix between the features and the target variable:

corr_matrix = df.corr()
corr_with_target = corr_matrix[‘target’]

We can now sort the correlation values in descending order and select the top k features:

k = 10
top_k = corr_with_target.abs().sort_values(ascending=False)[:k].index
selected_features = df[top_k]

Finally, we can check the correlation matrix between the selected features:

selected_corr_matrix = selected_features.corr()
selected_corr_matrix

Conclusion

In this blog, we have explored Correlation-based Feature Selection, a popular feature selection technique used in data science projects. CFS is a powerful technique that can help identify the most relevant variables that contribute to a model’s performance. We have also provided a code example using the Breast Cancer Wisconsin (Diagnostic) dataset to illustrate how CFS can be implemented in a real-world scenario.

Thank you so much for reading my article on Correlation-based Feature Selection. I hope you found it informative and enjoyable!

If you’d like to stay up-to-date on my future writing and support my work, please consider buying me a coffee. Your support helps me continue creating content like this. Click Below:

Additionally, you can follow me on LinkedIn and Twitter for more articles on Data Science in the coming weeks and months. I’d love to have you along for the journey.

Here are my LinkedIn: Sariq Sahazada

Thanks again for your support, and I hope to connect with you soon!

--

--