[ ML ] Kaggle에 적용해보는 XGBoost

3 min readMay 8, 2017

what is xgboost, how to tune parameters, kaggle tutorial

아직까지 100% 이해가 잘 안가지만 위 포스팅을 통해 조금이나마 내가 이해한 내용을 남겨보려 한다. 남겨두면 훗날 도움이 될 것이라고 생각한다.

Introduction

Random forest의 연장선 상에 있음. Random forest는 resampled data를 기반으로 모델들을 만들기 때문에 variance를 낮출 수 있는 장점이 있음
boosting algorithm이 기반이 됨
최근에 kaggle 유저들에게 큰 인기를 끌고 있음

Parallel Computing
Regularization :
linear나 tree based model에서 과적합을 피하는 방식 중 하나이다.
Enabled Cross Validation :
CV function이 내장되어 있다.
Missing Values :
결측치를 내부적으로 처리해준다. 실제로 kaggle에 적용해보는 과정에서 정말 편리하였다.
Flexbility :
objective function은 모델의 성능을 평가하는데 활용되는데 xgboost는 사용자 정의 objective function과 evaluation metrics를 사용할 수 있도록 해준다.
Availability :
python, R을 포함한 다양한 언어로 활용이 가능하다.
Save and Reload
Tree Pruning :
일반적인 gradient boosting에서는 tree pruning 과정이 negative loss가 발생하면 멈추게 된다. 하지만 xgboost는 max_depth까지 진행한 뒤 loss function 에서의 개선이 일정 threshold에 못미칠 경우까지 역방향으로 pruning과정을 진행한다.

xgboost는 regression과 classification에 모두 활용될 수 있다.

Regression Problem : booster = gbtree와 gblinear 파라미터 모두 가능 / linear model에서는 regularization과 gradient descent로 최적화
Classification Problem : booster = gbtree 파라미터를 사용 / 다음 tree는 이전 tree에서 오분류된 지점에 더 높은 가중치를 줌

이 글의 필자는 parameter와 관련해서 아래와 같이 말했다.

“ using xgboost without parameter tuning is like driving a car without changing its gears; you can never up your speed ”

nround : maximum number of iteration, similar to number of trees
eta : controls learning rate, lower eta -> slower computation
gamma : controls regularization
max_depth : depth of the tree, Larger depth -> complex model -> higher chance of over fitting , should be tuned using CV
min_child_weight : minimum number of instances required in a child node(?), simply it blocks the potential feature interactions to prevent over fitting
subsample : controls the number of samples supplied to a tree
colsample_bytree : number of features supplied to a tree
objective : methods for loss function
eval_metric : methods for evaluation, RMSE/error 등