LGBM and Feature Extraction

Using Light Gradient Boosting Machine model to find important features in a dataset with many features

Iftekher Mamun
4 min readAug 18, 2019
Source

On my last post, I talked about how I used some basic EDA and Seaborn to find information about my molecule prediction project. You can find that post by clicking here: Steps Before Machine Learning. I learned by the end I finished my heatmap/ correlation map for my data that Seaborn it self was not sufficient. Since my dataset is rather massive, I needed to look into other methods to help visualize and at the end highlight important features.

I went through multiple different visualizer such as plotly and D3.js. Plotly just was not able to handle even the subset of data as it kept breaking most likely due to the categorical variables. And after many attempts I was not able to set up D3.js with my program. The preset dataframe D3 had set did not line up with my dataset and I was not able to align them at the time.

So I decided to look into machine learning tools like the Principal Component Analysis that will help identify my important features. After some search and advice, I came across Light Gradient Boost Machines, or LGBM. Based on what I read, this was perfect as it would give me a baseline model using my subsample and it would also provide my with what features are contributing most to the prediction of the target variable.

However, it was very difficult to install LGBM on my MAC. It kept crashing, even when I went through the official documentation. Then I finally found a detailed installation of LGBM using brew install on my terminal. The installation guide can be found here. Remember to check your computer OS as installing for the wrong system will cause it to crash. After following the installation, I was able to import it directly into my jupyter notebook with the following command:

import lightgbm

Now that I have the most important library ready, I decided to train test split my dataset using sklearn.

from sklearn.model_selection import train_test_splitfeature= df_dropped_sample.drop(['scalar_coupling_constant'], axis=1)target= df_dropped_sample[['scalar_coupling_constant']]feature_train, feature_test, target_train, target_test= train_test_split(feature, target, test_size=0.12)

With that finished, I wanted to check the length and shape of my dataset.

print('total feature training features: ', len(feature_train))
print('total feature testing features: ', len(feature_test))
print('total target training features: ', len(target_train))
print('total target testing features: ', len(target_test))
#And the results were:total feature training features: 40991
total feature testing features: 5590
total target training features: 40991
total target testing features: 5590
feature_train.shape
(40991, 71)
target_train.shape
(40991, 1)

Okay, now that I have a training and testing set, it was time to import them into the lightgbm model. I used an example from kaggle to create a basic lgbm model. For an lgbm model to work, you have to instantiate your dataframe into their own model:

train_data = lightgbm.Dataset(feature_train, label=target_train, categorical_feature=categorical_features)test_data = lightgbm.Dataset(feature_test, label=target_test)

Now we have our training and testing set ready. Then I took the basic parameter from the kaggle example and just imported that into my code:

#basic parameter:
parameters = {
'application': 'binary',
'objective': 'binary',
'metric': 'auc',
'is_unbalance': 'true',
'boosting': 'gbdt',
'num_leaves': 31,
'feature_fraction': 0.5,
'bagging_fraction': 0.5,
'bagging_freq': 20,
'learning_rate': 0.05,
'verbose': 0
}

Finally, created the model to fit and run the test:

model = lightgbm.train(parameters,
train_data,
valid_sets=test_data,
num_boost_round=5000,
early_stopping_rounds=100)

At the [749] valid_0’s auc: 0.999874 run, I reached an accuracy of 99%. It’s obviously overfitting on the data, but my primary concern is which features are being weighted more. Thankfully, lgbm has a built in plot function that shows you exactly that:

ax = lightgbm.plot_importance(model, max_num_features=40, figsize=(15,15))
plt.show()

And it showed me this:

Here we see that most of the categorical and pointless and that is obvious. But we also see the top ten features we should concern our machine with. The next step of this project will be to focus on using this feature to build a proper linear regression machine.

--

--