Feature Selection Using Lasso Regression
Lasso Regression is a regularized linear regression that includes a L1 penalty. Lasso Regression can also be used for feature selection. The same will be demonstrated in this article.
- Loading the libraries.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import Lasso
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV, KFold
2. Loading the dataset.
df = pd.read_csv("/kaggle/input/pima-indians-diabetes-database/diabetes.csv")
print(df.head())
print("Shape of the Dataset: {}".format(df.shape))
3. Preprocessing and separating the train and test data.
# Segregating the Feature and Target
X = df.drop("Outcome", axis=1).values
y = df["Outcome"].values
# Train Test Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42, stratify=y)
print("Shape of Train Features: {}".format(X_train.shape))
print("Shape of Test Features: {}".format(X_test.shape))
print("Shape of Train Target: {}".format(y_train.shape))
print("Shape of Test Target: {}".format(y_test.shape))
4. Using GridSearchCV to find the best hyperparameter.
# parameters to be tested on GridSearchCV
params = {"alpha":np.arange(0.00001, 10, 500)}
# Number of Folds and adding the random state for replication
kf=KFold(n_splits=5,shuffle=True, random_state=42)
# Initializing the Model
lasso = Lasso()
# GridSearchCV with model, params and folds.
lasso_cv=GridSearchCV(lasso, param_grid=params, cv=kf)
lasso_cv.fit(X, y)
print("Best Params {}".format(lasso_cv.best_params_))
5. Feature Column Names.
names=df.drop("Outcome", axis=1).columns
print("Column Names: {}".format(names.values))
6. Using Lasso Regressor to plot the best features.
# calling the model with the best parameter
lasso1 = Lasso(alpha=0.00001)
lasso1.fit(X_train, y_train)
# Using np.abs() to make coefficients positive.
lasso1_coef = np.abs(lasso1.coef_)
# plotting the Column Names and Importance of Columns.
plt.bar(names, lasso1_coef)
plt.xticks(rotation=90)
plt.grid()
plt.title("Feature Selection Based on Lasso")
plt.xlabel("Features")
plt.ylabel("Importance")
plt.ylim(0, 0.15)
plt.show()
7. Segregating the features based on Step 6.
# Subsetting the features which has more than 0.001 importance.
feature_subset=np.array(names)[lasso1_coef>0.001]
print("Selected Feature Columns: {}".format(feature_subset))
# Adding the target to the list of feaatures.
feature_subset=np.append(feature_subset, "Outcome")
print("Selected Columns: {}".format(feature_subset))
8. Subsetting best features to form a new dataset.
df_new = df[new_feature]
print(df_new.head())
In this article, first we have tried to find the best alpha parameter value for Lasso Regression then we used the same alpha parameter in Lasso Regression to find the importance of each and every feature of the dataset. We subsetted those important columns to make a new dataset. Now this new dataset can be used for further classification or regression.
I appreciate you and the time you took out of your day to read this!
Linkedin: https://www.linkedin.com/in/saurav-agrawal-137500214/
StackOverFlow: https://stackoverflow.com/users/11842006/saurav-agrawal
Email: agrawalsam1997@gmail.com