Feature Selection Using Lasso Regression

Saurav Agrawal
3 min readJun 5, 2023

--

Photo By Nick Collins On Pexel

Lasso Regression is a regularized linear regression that includes a L1 penalty. Lasso Regression can also be used for feature selection. The same will be demonstrated in this article.

  1. Loading the libraries.
import numpy as np 
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import Lasso
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV, KFold

2. Loading the dataset.

df = pd.read_csv("/kaggle/input/pima-indians-diabetes-database/diabetes.csv")

print(df.head())
print("Shape of the Dataset: {}".format(df.shape))
Dataset Head and Shape

3. Preprocessing and separating the train and test data.

# Segregating the Feature and Target
X = df.drop("Outcome", axis=1).values
y = df["Outcome"].values

# Train Test Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42, stratify=y)

print("Shape of Train Features: {}".format(X_train.shape))
print("Shape of Test Features: {}".format(X_test.shape))
print("Shape of Train Target: {}".format(y_train.shape))
print("Shape of Test Target: {}".format(y_test.shape))
Separating the Train and Test Data

4. Using GridSearchCV to find the best hyperparameter.

# parameters to be tested on GridSearchCV
params = {"alpha":np.arange(0.00001, 10, 500)}

# Number of Folds and adding the random state for replication
kf=KFold(n_splits=5,shuffle=True, random_state=42)

# Initializing the Model
lasso = Lasso()

# GridSearchCV with model, params and folds.
lasso_cv=GridSearchCV(lasso, param_grid=params, cv=kf)
lasso_cv.fit(X, y)
print("Best Params {}".format(lasso_cv.best_params_))
Best alpha parameter

5. Feature Column Names.

names=df.drop("Outcome", axis=1).columns
print("Column Names: {}".format(names.values))
Feature Column Names

6. Using Lasso Regressor to plot the best features.

# calling the model with the best parameter
lasso1 = Lasso(alpha=0.00001)
lasso1.fit(X_train, y_train)

# Using np.abs() to make coefficients positive.
lasso1_coef = np.abs(lasso1.coef_)

# plotting the Column Names and Importance of Columns.
plt.bar(names, lasso1_coef)
plt.xticks(rotation=90)
plt.grid()
plt.title("Feature Selection Based on Lasso")
plt.xlabel("Features")
plt.ylabel("Importance")
plt.ylim(0, 0.15)
plt.show()

7. Segregating the features based on Step 6.

# Subsetting the features which has more than 0.001 importance.
feature_subset=np.array(names)[lasso1_coef>0.001]
print("Selected Feature Columns: {}".format(feature_subset))

# Adding the target to the list of feaatures.
feature_subset=np.append(feature_subset, "Outcome")
print("Selected Columns: {}".format(feature_subset))
Creating New List of Important Feature Columns and Target Column

8. Subsetting best features to form a new dataset.

df_new = df[new_feature]
print(df_new.head())
New Dataset Based on best features,

In this article, first we have tried to find the best alpha parameter value for Lasso Regression then we used the same alpha parameter in Lasso Regression to find the importance of each and every feature of the dataset. We subsetted those important columns to make a new dataset. Now this new dataset can be used for further classification or regression.

I appreciate you and the time you took out of your day to read this!

Linkedin: https://www.linkedin.com/in/saurav-agrawal-137500214/

StackOverFlow: https://stackoverflow.com/users/11842006/saurav-agrawal

Email: agrawalsam1997@gmail.com

--

--

Saurav Agrawal

3x AWS Certified. Data Engineering, Machine Learning, Stocks and Finance. Buy me a coffee at https://www.buymeacoffee.com/SauravAgrawal