Linear Regression of Selected Features
Part 2 of LGBM. Taking the important features and running linear regression model through them
My earlier post on LGBM models and using it to extract important features can be found here. In this post, I will take those important features and run them through linear regression, do further analysis and identify the relationship between these features and the target variable.
I decided to work with the top 14 features from the feature importance section. Using seaborn correlation map, I extracted this heatmap
Here we see that fc, pso, sd and dso are correlated with each other positively but negatively with the rest. And many of these features have no correlation at all (all the black spots). To see the quantification of this, I ran a simple Ordinary Least Squared (OLS) model on these features with the target:
X= feature_train
y= target_train
model= sm.OLS(y, X).fit()
predictions= model.predict(X)
model.summary()
From here we see that there is a division. Quite a bit of these features have no significant impact on the target variable (scalar coupling constant). While the first four variables have a high T score value as well as a p-value that is less than 0.05. This further confirms the correlation heat map. Even within the important features, some are playing a bigger significant role than the rest.
So I isolated the features fc, pso, sd, and dso into a separate test and ran them against the target
X= feature_train[['fc', 'pso', 'sd', 'dso']]
y= target_train
model= sm.OLS(y, X).fit()
predictions= model.predict(X)
model.summary()
Then I isolated the rest of the features and ran them against the target variable:
X= feature_train[['mulliken_charge_atom1', 'YY', 'XX_atom1', 'XX', 'ZZ', 'YY_atom1', 'mulliken_charge',
'potential_energy', 'ZZ_atom1', 'Z']]
y= target_train
model= sm.OLS(y, X).fit()
predictions= model.predict(X)
model.summary()
This showed that these features are not as important to determine the target. The next step would be to run through a Root Mean Squared Error (RMSE) test to see if all these features from LGBM can actually predict well:
from sklearn.linear_model import LinearRegression
linreg= LinearRegression()
model= linreg.fit(feature_train, target_train)y_hat_train= linreg.predict(feature_train)
y_hat_test= linreg.predict(feature_test)train_residuals = y_hat_train - target_train
test_residuals = y_hat_test - target_testprint(len(feature_train), len(target_train), len(feature_test), len(target_test))419233 419233 46582 46582 #just to make sure all the dimensions are correct
Now to create the actual RMSE calculation. Generally, the smaller the number, the better your algorithm is
mse_train = np.sum((target_train - y_hat_train)**2)/len(target_train)
mse_test = np.sum((target_test - y_hat_test)**2) / len(target_test)
rmse_train=np.sqrt(mse_train)
rmse_test=np.sqrt(mse_test)
print("Train MSE", mse_train)
print("Test MSE", mse_test)
print("Train RMSE", rmse_train)
print("Test RMSE", rmse_test)
I liked these numbers, despite what my OLS model showed earlier. So I decided to check for this by testing it against an unknown test set and see how well it predicts:
y= y_hat_train - target_train
sns.residplot(target_train, y_hat_train, data=feature_target)
Now we know that there are some heavy outliers presents and based on our OLS summary I can guess that this is due to the features that are not contributing enough. I will test for this model again but only use the features from the OLS model.