Predicting Heart Failure Using Machine Learning, Part 2

Andrew A Borkowski
Oct 10, 2020 · 5 min read

The easy way to XGBoost parameter optimization

Photo by Robina Weermeijer on Unsplash

I predicted heart failure using Random Forest, XGBoost, Neural Network, and an ensemble of models in my previous article. In this post, I would like to go over XGBoost parameter optimization to increase the model’s accuracy.

According to the official XGBoost website, XGBoost is defined as an optimized distributed gradient boosting library designed to be highly efficient, flexible, and portable. It implements machine learning algorithms under the gradient boosting framework. XGBoost provides a parallel tree boosting (also known as GBDT, GBM) that solve many data science problems in a fast and accurate way.

XGBoost is very popular with participants of Kaggle competitions because it can achieve a very high model accuracy. The only problem with it is the number of parameters one has to optimize to get good results.

XGBoost has three types of parameters: general parameters, booster parameters, and task parameters. General parameters select which booster you are using to do boosting, commonly tree or linear model; booster parameters depend on which booster you have chosen; learning task parameters specify the learning task and the corresponding learning objective. A detailed description of all parameters can be found here.

Going over all parameters is beyond the scope of this article. Instead, I will concentrate on optimizing the following selected tree booster parameters to increase the accuracy of our XGBoost model:

  1. Parameters that help prevent overfitting (aliases are for XGBoost python sklearn wrapper that uses sklearn naming convention)

eta [default=0.3, range: [0,1], alias: learning_rate]

  • Step size shrinkage used in update to prevents overfitting. After each boosting step, we can directly get the weights of new features, and eta shrinks the feature weights to make the boosting process more conservative.

max_depth [default=6, range[0,∞]]

  • Maximum depth of a tree. Increasing this value will make the model more complex and more likely to overfit. 0 is only accepted in lossguided growing policy when tree_method is set as hist and it indicates no limit on depth. Beware that XGBoost aggressively consumes memory when training a deep tree.

min_child_weight [default=1, range[0,∞]]

  • Minimum sum of instance weight (hessian) needed in a child. If the tree partition step results in a leaf node with the sum of instance weight less than min_child_weight, then the building process will give up further partitioning. In linear regression task, this simply corresponds to minimum number of instances needed to be in each node. The larger min_child_weight is, the more conservative the algorithm will be.

gamma [default=0, range[0,∞], alias: min_split_loss]

  • Minimum loss reduction required to make a further partition on a leaf node of the tree. The larger gamma is, the more conservative the algorithm will be.

subsample [default=1, range: [0,1] ]

  • Subsample ratio of the training instances. Setting it to 0.5 means that XGBoost would randomly sample half of the training data prior to growing trees. and this will prevent overfitting. Subsampling will occur once in every boosting iteration.

colsample_bytree [default=1, range [0,1]]

  • Colsample_bytree is the subsample ratio of columns when constructing each tree. Subsampling occurs once for every tree constructed.

lambda [default=1, alias: reg_lambda]

  • L2 regularization term on weights. Increasing this value will make model more conservative.

alpha [default=0, alias: reg_alpha]

  • L1 regularization term on weights. Increasing this value will make model more conservative.

2. Parameter to handle imbalanced dataset

scale_pos_weight [default=1]

  • Control the balance of positive and negative weights, useful for unbalanced classes. A typical value to consider: sum(negative instances) / sum(positive instances).

3. Other parameters

n_estimators [default=100]

  • Number of gradient boosting trees. Equivalent to number of boosting rounds.

All the above parameter definitions are from the official XGBoost website.

With this abbreviated knowledge of tree booster parameters, let’s import libraries, load our dataset, create independent and dependent variables, and split our dataset into training and testing sets.

Now, let’s train our model with the default parameters.

As you can see, even with default parameters, the model provided us with acceptable results. To find optimal parameters, I used GridSearchCV, a library function that is a member of the sklearn’s model_selection package. It helps to loop through predefined hyperparameters and fit our model on the training set. So, in the end, we can select the best parameters from the listed hyperparameters. I tried three values for each of the following parameters: learning_rate, max_depth, min_child_weight, gamma, subsample, and colsample_bytree.

Next, I trained our model with updated parameters. Since I was decreasing learning_rate, I increased the number of gradient boosting trees (n_estimators).

With this one simple step, I managed to increase validation accuracy from 76.67% to 80.00%, and validation AUC (area under the curve) from 68.81% to 75.48%.

With plenty of time and computer power, one can expend the range of values for booster parameters search and use the GridSearchCV results as a base for further parameter investigation. For example, if GridSearchCV learning_rate decreases from the default of 0.3 to 0.2, in the next round of search, we can move the range further to the left like [0.05, 0.1, 0.2].

Next, I tried to optimize regularization parameters reg_lambda and reg_alpha.

The GridSearchCV found default values of reg_alpha and reg_lambda to be optimal. The last parameter left to optimize was scale_pos_weight, which I kept increasing until I found a value of 4 to give me the best results. I then trained my model with these final optimized parameters. Below is the code with extra metrics, including sensitivity, specificity, positive predictive value, and negative predictive value.

Conclusion: With the help of GridSearchCV, in a few simple steps, I managed to increase validation accuracy from 76.67% to 78.33% and validation AUC (area under the curve) from 68.81% to 77.09%. Although one can get the best accuracy improvements with feature engineering, XGBoost parameter optimization is also worthwhile.

Thank you for taking the time to read this post.

Best wishes in these difficult times.
Andrew
@tampapath

Analytics Vidhya