LightGBM Starter Code

Udbhav Pangotra
Geek Culture
Published in
3 min readJan 16, 2022

Here is your first LightGBM code!

Photo by Bench Accounting on Unsplash

LightGBM is a gradient boosting framework that uses tree based learning algorithms. It is designed to be distributed and efficient with the following advantages:

  • Faster training speed and higher efficiency
  • Lower memory usage
  • Better accuracy
  • Support of parallel and GPU learning
  • Capable of handling large-scale data
import os #to access files
import pandas as pd #to work with dataframes
import numpy as np #just a tradition
from sklearn.model_selection import StratifiedKFold #for cross-validation
from sklearn.metrics import roc_auc_score #this is we are trying to increase
import matplotlib.pyplot as plt #we will plot something at the end)
import seaborn as sns #same reason
import lightgbm as lgb #the model we gonna use

Let’s read the data: train, target and test

train = pd.read_csv('../input/train.csv')
test = pd.read_csv('../input/test.csv')

for c in train.columns:
if train[c].dtype == 'object':
lbl = LabelEncoder()
lbl.fit(list(train[c].values) + list(test[c].values))
train[c] = lbl.transform(list(train[c].values))
test[c] = lbl.transform(list(test[c].values))

Shape of the dataset

print('Shape train: {}\nShape test: {}'.format(train.shape, test.shape))n_comp = 6

# PCA
pca = PCA(n_components=n_comp, random_state=42)
pca2_results_train = pca.fit_transform(train.drop(["y"], axis=1))
pca2_results_test = pca.transform(test)

# ICA
ica = FastICA(n_components=n_comp, random_state=42)
ica2_results_train = ica.fit_transform(train.drop(["y"], axis=1))
ica2_results_test = ica.transform(test)

# Append decomposition components to datasets
for i in range(1, n_comp+1):
train['pca_' + str(i)] = pca2_results_train[:,i-1]
test['pca_' + str(i)] = pca2_results_test[:, i-1]

train['ica_' + str(i)] = ica2_results_train[:,i-1]
test['ica_' + str(i)] = ica2_results_test[:, i-1]


# remove duplicates - needs to be applied to test too
# train = train.T.drop_duplicates().T
# test = test.T.drop_duplicates().T


y_train = train["y"]
y_mean = np.mean(y_train)
train.drop('y', axis=1, inplace=True)

Training the model!

X_train, X_valid, y_train, y_valid = train_test_split(
train, y_train, test_size=0.2, random_state=9127)

# create dataset for lightgbm
lgb_train = lgb.Dataset(X_train, y_train)
lgb_valid = lgb.Dataset(X_valid, y_valid, reference=lgb_train)

# to record eval results for plotting
evals_result = {}

# The r2 is: 0.648019302812 the rmse is: 7.2525692268
# specify your configurations as a dict
params = {
'task': 'train',
'boosting_type': 'gbdt',
'objective': 'regression',
'metric': {'l2'},
'num_leaves': 5,
'learning_rate': 0.06,
'max_depth': 4,
'subsample': 0.95,
'feature_fraction': 0.9,
'bagging_fraction': 0.85,
'bagging_freq': 4,
'min_data_in_leaf':4,
'min_sum_hessian_in_leaf': 0.8,
'verbose':10
}

print('Start training...')

# train
gbm = lgb.train(params,
lgb_train,
num_boost_round=8000, # 200
valid_sets=[lgb_train, lgb_valid],
evals_result=evals_result,
verbose_eval=10,
early_stopping_rounds=50) # 50

#print('\nSave model...')
# save model to file
#gbm.save_model('model.txt')

print('Start predicting...')
# predict
y_pred = gbm.predict(X_valid, num_iteration=gbm.best_iteration)

Evaluation

# feature importances
print('Feature importances:', list(gbm.feature_importance()))

# -------------------------------------------------------
print('Plot metrics during training...')
ax = lgb.plot_metric(evals_result, metric='l2')
plt.show()

print('Plot feature importances...')
ax = lgb.plot_importance(gbm, max_num_features=10)
plt.show()
# -------------------------------------------------------
# eval r2-score
from sklearn.metrics import r2_score
r2 = r2_score(y_valid, y_pred)

# eval rmse (lower is better)
print('\nThe r2 is: ',r2, 'the rmse is:', mean_squared_error(y_valid, y_pred) ** 0.5)

# -------------------------------------------------------
print('\nPredicting test set...')
y_pred = gbm.predict(test, num_iteration=gbm.best_iteration)

# y_pred = model.predict(dtest)
output = pd.DataFrame({'id': test['ID'], 'y': y_pred})
output.to_csv('submit-lightgbm-ICA-PCA.csv', index=False)

# -----------------------------------------------------------------------------
print("Finished.")

This should work for you! Cheers!

Do reach out and comment if you get stuck!

Other articles that might be interested in:
- Getting started with Apache Spark — I | by Sam | Geek Culture | Jan, 2022 | Medium
- Getting started with Apache Spark II | by Sam | Geek Culture | Jan, 2022 | Medium
- Getting started with Apache Spark III | by Sam | Geek Culture | Jan, 2022 | Medium
- Streamlit and Palmer Penguins. Binged Atypical last week on Netflix… | by Sam | Geek Culture | Medium
- Getting started with Streamlit. Use Streamlit to explain your EDA and… | by Sam | Geek Culture | Medium

Cheers and do follow for more such content! :)

You can now buy me a coffee too if you liked the content!
samunderscore12 is creating data science content! (buymeacoffee.com)

--

--