Linear Regression of Selected Features

Part 2 of LGBM. Taking the important features and running linear regression model through them

Iftekher Mamun
3 min readSep 13, 2019

My earlier post on LGBM models and using it to extract important features can be found here. In this post, I will take those important features and run them through linear regression, do further analysis and identify the relationship between these features and the target variable.

I decided to work with the top 14 features from the feature importance section. Using seaborn correlation map, I extracted this heatmap

sns.heatmap(features.corr(), center=0);

Here we see that fc, pso, sd and dso are correlated with each other positively but negatively with the rest. And many of these features have no correlation at all (all the black spots). To see the quantification of this, I ran a simple Ordinary Least Squared (OLS) model on these features with the target:

X= feature_train
y= target_train
model= sm.OLS(y, X).fit()
predictions= model.predict(X)
model.summary()
Just focus on the t-test and the p value

From here we see that there is a division. Quite a bit of these features have no significant impact on the target variable (scalar coupling constant). While the first four variables have a high T score value as well as a p-value that is less than 0.05. This further confirms the correlation heat map. Even within the important features, some are playing a bigger significant role than the rest.

So I isolated the features fc, pso, sd, and dso into a separate test and ran them against the target

X= feature_train[['fc', 'pso', 'sd', 'dso']]
y= target_train
model= sm.OLS(y, X).fit()
predictions= model.predict(X)
model.summary()
Now see that the correlation did not change (R-squared)

Then I isolated the rest of the features and ran them against the target variable:

X= feature_train[['mulliken_charge_atom1', 'YY', 'XX_atom1', 'XX', 'ZZ', 'YY_atom1', 'mulliken_charge', 
'potential_energy', 'ZZ_atom1', 'Z']]
y= target_train
model= sm.OLS(y, X).fit()
predictions= model.predict(X)
model.summary()
Observe the huge change in R-squared

This showed that these features are not as important to determine the target. The next step would be to run through a Root Mean Squared Error (RMSE) test to see if all these features from LGBM can actually predict well:

from sklearn.linear_model import LinearRegression
linreg= LinearRegression()
model= linreg.fit(feature_train, target_train)
y_hat_train= linreg.predict(feature_train)
y_hat_test= linreg.predict(feature_test)
train_residuals = y_hat_train - target_train
test_residuals = y_hat_test - target_test
print(len(feature_train), len(target_train), len(feature_test), len(target_test))419233 419233 46582 46582 #just to make sure all the dimensions are correct

Now to create the actual RMSE calculation. Generally, the smaller the number, the better your algorithm is


mse_train = np.sum((target_train - y_hat_train)**2)/len(target_train)
mse_test = np.sum((target_test - y_hat_test)**2) / len(target_test)
rmse_train=np.sqrt(mse_train)
rmse_test=np.sqrt(mse_test)
print("Train MSE", mse_train)
print("Test MSE", mse_test)
print("Train RMSE", rmse_train)
print("Test RMSE", rmse_test)
Both shows rather small root mean square error showing data is valid. However, this is including all the selected features, not the main ones determined from above ols

I liked these numbers, despite what my OLS model showed earlier. So I decided to check for this by testing it against an unknown test set and see how well it predicts:

y= y_hat_train - target_train
sns.residplot(target_train, y_hat_train, data=feature_target)
Shows huge outliers with some variables. Now to test it with the ols determined significant features

Now we know that there are some heavy outliers presents and based on our OLS summary I can guess that this is due to the features that are not contributing enough. I will test for this model again but only use the features from the OLS model.

--

--