TPS-Mar21, Leaderboard %14, XGB, CatBoost, LGBM + Optuna 🚀

Part 2, Xgboost, CatBoost, and Lightgbm with Optuna…

Hasan Basri Akçay
DataBulls
5 min readFeb 7, 2022

--

Design vector created by freepikwww.freepik.com

Modeling is one of the most important parts of predictions. You should find the best machine learning model for better results. In part 1, we worked on EDA and feature engineering. You can see this article here.

In this part of the article, we compared three ml models that are Xgboost, Catboost, and LGBM. The competition metric is Area Under the Receiver Operating Characteristic Curve (ROC AUC).

You can see the dataset here and you can see full python code at the end of the article.

Introduction

Firstly we imported the libraries and then calculated the baseline scores of these machine learning models.

from xgboost import XGBClassifier
from lightgbm import LGBMClassifier
from catboost import CatBoostClassifier
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import StratifiedKFold
import optuna
from optuna.visualization import plot_optimization_history
from optuna.visualization import plot_param_importances
# XGBClassifier
xgbc_model = XGBClassifier(min_child_weight=0.1, reg_lambda=100, booster='gbtree', objective='binary:logitraw', random_state=42)
xgbc_score = cross_val_score(xgbc_model, train_X, train_y, scoring='roc_auc', cv=5)
print('xgbc_score: ', xgbc_score.mean())

# LGBMClassifier
ligthgbmc_model = LGBMClassifier(boosting_type='gbdt', objective='binary', random_state=42)
ligthgbmc_score = cross_val_score(ligthgbmc_model, train_X, train_y, scoring='roc_auc', cv=5)
print('ligthgbmc_score: ', ligthgbmc_score.mean())

# CatBoostClassifier
cbc_model = CatBoostClassifier(loss_function='Logloss', random_state=42, verbose=False)
cbc_score = cross_val_score(cbc_model, train_X, train_y, scoring='roc_auc', cv=5)
print('cbc_score: ', cbc_score.mean())
####################################################################Outputs:
xgbc_score: 0.8898202612356174
ligthgbmc_score: 0.8879385374274603
cbc_score: 0.8909648517647316

Xgboost + Optuna

According to baseline scores, the best model is catboost but it can be changed after hyperparameter tuning. You can see XGB usage with Optuna below.

def objective(trial, data=X, target=y):
X_train, X_val, y_train, y_val = train_test_split(data, target, test_size=0.2, random_state=42)

params = {
'max_depth': trial.suggest_int('max_depth', 3, 32),
'learning_rate': trial.suggest_categorical('learning_rate', [0.005, 0.02, 0.05, 0.08, 0.1]),
'n_estimators': trial.suggest_int('n_estimators', 2000, 8000),
'min_child_weight': trial.suggest_int('min_child_weight', 1, 300),
'gamma': trial.suggest_float('gamma', 0.0001, 1.0, log = True),
'alpha': trial.suggest_float('alpha', 0.0001, 10.0, log = True),
'lambda': trial.suggest_float('lambda', 0.0001, 10.0, log = True),
'colsample_bytree': trial.suggest_float('colsample_bytree', 0.1, 0.8),
'subsample': trial.suggest_float('subsample', 0.1, 0.8),
'tree_method': 'gpu_hist',
'booster': 'gbtree',
'random_state': 42,
'use_label_encoder': False,
'eval_metric': 'auc'

}

model = XGBClassifier(**params)
model.fit(X_train, y_train, eval_set = [(X_val,y_val)], early_stopping_rounds = 333, verbose = False)
y_pred = model.predict_proba(X_val)[:,1]
roc_auc = roc_auc_score(y_val, y_pred)

return roc_auc
study = optuna.create_study(direction='maximize')
study.optimize(objective, n_trials=50)
print('Best value: ', study.best_value)
####################################################################Outputs:
Best value: 0.8951492161710065

CatBoost + Optuna

def objective(trial, data=X, target=y):
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

params = {
'max_depth': trial.suggest_int('max_depth', 3, 64),
'learning_rate': trial.suggest_categorical('learning_rate', [0.005, 0.02, 0.05, 0.08, 0.1]),
'n_estimators': trial.suggest_int('n_estimators', 2000, 8000),
'max_bin': trial.suggest_int('max_bin', 200, 400),
'min_data_in_leaf': trial.suggest_int('min_data_in_leaf', 1, 300),
'l2_leaf_reg': trial.suggest_float('l2_leaf_reg', 0.0001, 1.0, log = True),
'subsample': trial.suggest_float('subsample', 0.1, 0.8),
'random_seed': 42,
'task_type': 'GPU',
'loss_function': 'Logloss',
'eval_metric': 'AUC',
'bootstrap_type': 'Poisson'
}

model = CatBoostClassifier(**params)
model.fit(X_train, y_train, eval_set = [(X_val,y_val)], early_stopping_rounds = 222, verbose = False)
y_pred = model.predict_proba(X_val)[:,1]
roc_auc = roc_auc_score(y_val, y_pred)

return roc_auc
study = optuna.create_study(direction = 'maximize')
study.optimize(objective, n_trials = 50)
print('Best value:', study.best_value)
####################################################################Outputs:
Best value: 0.8925910141177894

LGBM + Optuna

After hyperparameter optimization, we can see that LGBM is the best model now.

def objective(trial,data=X,target=y):   
train_x, test_x, train_y, test_y = train_test_split(data, target, test_size=0.15,random_state=42)
params = {
'reg_alpha': trial.suggest_float('reg_alpha', 0.001, 10.0),
'reg_lambda': trial.suggest_float('reg_lambda', 0.001, 10.0),
'num_leaves': trial.suggest_int('num_leaves', 11, 333),
'min_child_samples': trial.suggest_int('min_child_samples', 5, 100),
'max_depth': trial.suggest_int('max_depth', 5, 64),
'learning_rate': trial.suggest_categorical('learning_rate', [0.01, 0.02, 0.05, 0.005, 0.1]),
'colsample_bytree': trial.suggest_float('colsample_bytree', 0.1, 0.5),
'n_estimators': trial.suggest_int('n_estimators', 2000, 8000),
'cat_smooth' : trial.suggest_int('cat_smooth', 10, 100),
'cat_l2': trial.suggest_int('cat_l2', 1, 20),
'min_data_per_group': trial.suggest_int('min_data_per_group', 50, 200),
'cat_feature' : [11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31,
32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52,
53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67],
'n_jobs' : -1,
'random_state': 42,
'boosting_type': 'gbdt',
'metric': 'AUC',
'device': 'gpu'
}
model = LGBMClassifier(**params)
model.fit(train_x,train_y,eval_set=[(test_x,test_y)],eval_metric='auc', early_stopping_rounds=300, verbose=False)
preds = model.predict_proba(test_x)[:,1]
auc = roc_auc_score(test_y, preds)

return auc
study = optuna.create_study(direction='maximize')
study.optimize(objective, n_trials=50)
####################################################################Outputs:
Best value: 0.8966645758299353

Visualizations

Optimization History

# Historic
plot_optimization_history(study)
Optimization History Plot — image by author

Hyperparameter Importances

# Importance
optuna.visualization.plot_param_importances(study)
Hyperparameter Importances Plot — image by author

Conclusion

This is part 2 of the TPS-Mar21 competition that I am in LB %14. In this article, we compared famous machine learning boosting models for better prediction. Due to the results, Lightgbm is the best model for this problem.

According to the problem, the best boosting model can change. Also, sometimes speed can be more important than success. You can find more detailed information in this article about when to choose which boosting model.

You can see full python code and all plots from here 👉 Kaggle Notebook.

👋 Thanks for reading. If you enjoy my work, don’t forget to like it 👏, follow me on Medium and LinkedIn. It will motivate me in offering more content to the Medium community! 😊

References:

[1]: https://www.kaggle.com/hasanbasriakcay/xgb-catboost-lgbm-optuna-lb-14/notebook
[2]: https://www.kaggle.com/c/tabular-playground-series-mar-2021/data
[3]: https://optuna.readthedocs.io/en/stable/reference/study.html
[4]: https://xgboost.readthedocs.io/en/stable/
[5]: https://catboost.ai/en/docs/
[6]: https://lightgbm.readthedocs.io/en/latest/
[7]: https://neptune.ai/blog/when-to-choose-catboost-over-xgboost-or-lightgbm

More…

--

--