Analytics Vidhya
Published in

Analytics Vidhya

ML25: Top 7 % in Give-Me-Some-Credit on Kaggle

Utilizing LR, RF, XGBoost with a two-layer stacking

  • This article is highly simplified. Check the repository on GitHub for complete information.
  • After thorough feature engineering, I leveraged LR, RF & XGBoost, then did a double-layer stacking. Finally, I got 14.83% (17/924) on public leaderboard and 6.82% (63/924) on private leaderboard, which equivalent to getting a bronze medal in this long closed competition.

Outline
(1) Introduction
(2) Feature Engineering
(3) Model Selection: LR, RF & XGBoost

(4) Two-Layer Stacking
(5) Outcomes & Ranking
(6) Conclusion
(7) References

(1) Introduction

  • It’s a closed competition on Kaggle in 2011. Competitors were required to predict credit default based on an unbalanced dataset with target having (0, 1) = (93.32% , 6.68%). Therefore, the model evaluation metric was AUC.
  • I did this project in 2020/06 as the final project of the graduate level course “R Computing for Business Data Analytics” of department of MIS in NCCU. In addition, I got 97 (A+) in this course.

(2) Feature Engineering

  • Adjusting outliers.
  • Taking log() of all features since they are mostly positively skewed.

2–1 EDA BEFORE Feature Engineering

2–2 EDA AFTER Feature Engineering

(3) Model Selection: LR, RF & XGBoost

3–1 Logistic Regression

Generating higher-degree terms as well as interactions, then leveraging stepwise logistic regression (both directions) to choose the influential features. Afterwards, choosing features based on AIC, BIC and the p-values of the features.

3–2 Treelike Methods: RF & XGBoost

From the fact that XGBoost outperforms RF, we conclude that the noise of this dataset is negligible. Confirming the noise exist or not could result in completely different implications for tuning.

(4) Two-Layer Stacking

Adopting two-layer stacking. First, get the stacking of LR, RF & XGBoost models using mean respectively. Next, put several LR_stacking, RF_stacking & XGBT_stacking together and do another stacking using mean. Then, we obtain the final AUC results.

(5) Outcomes & Ranking

Showing outcomes here.

  • After thorough feature engineering, I leveraged LR, RF & XGBoost, then did a double-layer stacking. Finally, I got 14.83% (17/924) on public leaderboard and 6.82% (63/924) on private leaderboard, which is equivalent to getting a bronze medal in this long closed competition.

(6) Conclusion

  • We choose the best public score one with 14.83% (137/924) on public leaderboard, which correspond to achieving 6.82% (63/924) on private leaderboard, which is equivalent to getting a bronze medal in this long closed competition.
  • We may add interactions in to RF & XGBoost models for better performance.

(7) References

  1. Ozdemir, S., Susarla, D. (2018). Feature Engineering Made Easy.
  2. Zheng, A., Casari A. (2018). Feature Engineering for Machine Learning.
  3. Bonaccorso, G. (2017). Machine Learning Algorithms (2nd ed.). Birmingham, UK: Packt Publishing.
  4. Battiti, R., Brunato, M. (2017). Learning plus Intelligent Optimization Paperback. Trento, Italy: LIONlab, University of Trento.
  5. Zumel, N., Mount, J. (2014) Practical Data Science with R.
  6. Online forum of the dataset “Give Me Some Credit”(2011) on Kaggle. Retrieved from https://bit.ly/3eWviGl

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Yu-Cheng (Morton) Kuo

Yu-Cheng (Morton) Kuo

ML/DS using Python & R. A Taiwanese earned MBA from NCCU and BS from NTHU with MATH major & ECON minor. Email: morton.kuo.28@gmail.com