Hyperparameter tuning for hyperaccurate XGBoost model

No Data Scientist is the Same — part 4

Tom Blanke
Cmotions
Published in
15 min readFeb 27, 2022

--

This article is part of our series about how different types of data scientists build similar models differently. No human is the same and therefore also no data scientist is the same. And the circumstances under which a data challenge needs to be handled changes constantly. For these reasons, different approaches can and will be used to complete the task at hand. In our series we will explore the four different approaches of our data scientists — Meta Oric, Aki Razzi, Andy Stand, and Eqaan Librium. They are presented with the task to build a model to predict whether employees of a company — STARDATAPEPS — will look for a new job or not. Based on their distinct profiles discussed in the first blog you can already imagine that their approaches will be quite different.

In the previous article Meta Oric’s aim was to quickly create a default XGBoost and therefore she sticked with the default settings. In this article Aki Razi is going to search for a better model performance and tries to tune the hyperparameters. Before we start with discussing how she tries to do that, let me first remind you of who Aki Razzi is:

Aki Razzi: ‘Accuracy is what truly matters’

Aki has won multiple Kaggle competitions, since her models achieve the highest possible performance. Time and resources do not matter that much to her. Hail the almighty accuracy, precision and recall. She does not care whether a technique is easy to explain or not. Similarly, she is no stranger to using ensemble models to achieve the near-perfect performance as well as very convoluted feature engineering techniques.

Using XGBoost to predict which Data Scientists are likely to change jobs

First of all, Aki imports the necessary packages, among which ‘xgboost’ enabling her to create a XGBoost:

Second, Aki loads the dataset. A bit of preparation on this data was done, as described here. The target variable ‘target’ indicates whether a data scientist in this historic dataset has left the company. All other columns in the dataset are possible predictors of whether a data scientist is likely to leave the company soon.

Aki’s aim is to assess if she can improve Meta’s default XGBoost by tuning the hyperparameters. To make a fair comparison she performs the same data preparation steps as Meta did. She imputes the missing values, converts the categorical variables into dummies and standardizes the numerical variables. In addition, just like Meta did, Aki also uses pipelines to prep the data, to train her XGBoost, and to make predictions.

In the code below Aki separates the target from the features, creates a train and test dataset, and creates the pipelines to prep the features:

Next, Aki recreates Meta’s default XGBoost:

Output:

With only default parameters without hyperparameter tuning, Meta’s XGBoost got a ROC AUC score of 0.7915. As you can see below XGBoost has quite a lot of hyperparameters that Aki can tune to try to improve Meta’s default XGBoost.

Output:

After introducing you to Aki Razzi you can image that Aki is not yet satisfied with a XGBoost with only the default parameters. Aki attempts to improve Meta’s default XGBoost with the use of the GridSearchCV function from the scikit-learn package to optimize the model. GridSearchCV accepts possible values for the provided hyperparameters and fits separate models on the given data for each combination of hyperparameters. The performance of each combination of hyperparameters is evaluated and afterwards the best performing model can easily be selected. Thus, GridSearchCV enables Aki to tune multiple hyperparameters at once. It is not feasible to tune all hyperparameters at once, because this will result in way too many models. Aki optimizes seven hyperparameters with the use of 5-fold cross validation, meaning that if she tries to tune all parameters with just one gridsearch and with e.g. five different values for each parameter a total of 5x5x5x5x5x5x5x5 = 390.625 trees (seven times 5 for each parameter and one time 5 for the 5-fold cross validation) would be trained. Hence, Aki tunes the model in multiple steps.

As Meta did in the previous article, Aki evaluates each model created in the grid search based on ROC AUC score. She does so with the use of this function:

Aki starts with searching for the optimum parameters for the learning rate and the number of estimators (n_estimators). She begins her search with the commonly used starting value of 0.8 for subsample and colsample_bytree and keeps all other parameters at its default.

  • Learning_rate (eta): determines how fast the XGBoost model learns. In the boosting process, each additional tree modifies the overall model. The magnitude of the modification is controlled by the learning rate. A low learning rate makes computation slower, and requires more trees to achieve the same reduction in residual error as a model with a high learning rate. But, it optimizes the chances to reach the best optimum (optimum = perfect bias/variance tradeoff = no underfitting and no overfitting). Typically used values are 0.01–0.3 and its default value is 0.3.
  • N_estimators: determines how many decisions trees will be built and boosted. If n_estimator is set equal to n_estimators=1, just a single decision tree will be created and thereby there is no boosting happening. This would make the result similar to training just a standard decision tree. The default XGBoost above (xgb_cl: Meta’s XGBoost) shows that the default value for n_estmators is 100 and it must be an integer greater than 0. The larger value you take for n_estimators, the more accurate the model performs on the trainset due to the Gradient Boosting algorithm. But it takes a longer time to train the model and it might overfit the model on your trainset.

In the code below you see that the GridsearchCV contains a couple of parameters. First of all, it uses the created XGBoost pipeline: ‘xgb_pipeline’. Second, it uses a specified grid that will be tested in the gridsearch: ‘param_grid’. A third parameter that is specified for the gridsearch is n_jobs. N_jobs is the number of jobs to run in parallel. Aki has set it equal to -1, meaning that the grid search will use all available processors. Fourth, Aki uses 5-fold cross validation to tune the hyperparameters. Finally, the performance evaluation metric is set equal to the ROC AUC score for the cross-validation with the scoring parameter.

Output:

So, with only changing the number of estimators and the learning rate, Aki already improves the ROC AUC score from 0.7915 to 0.8037 compared to Meta’s XGBoost with default settings.

Next up, with the best values identified for the number of estimators and the learning rate, Aki continues with optimizing the parameters: max_depth and min_child_weight.

  • Max_depth: the maximum depth of a tree. It is used to control for overfitting. Increasing this value will make the model more complex, allowing the model to learn relations very specific to the sample at hand and making it more likely to overfit. In addition, XGBoost aggressively consumes memory when training a deep tree. Typically used values are 3–10 and as its default value is 6.
  • Min_child_weight: the minimum sum of weights of all observations needed in a child node. So, if the tree partition step results in a leaf node with the sum of instance weight less than the min_child_weight, the building process will give up further partitioning. This parameter is also used to control for overfitting. Higher values prevent a model from learning relations which might be highly specific to the particular sample selected for a tree. But, too high values can lead to underfitting. Hence, it should be tuned using CV. The min_child_weight can be any number and the default is 1.

Output:

The search for optimal values for the maximum depth of a tree and the minimum child weight resulted in only changing the maximum depth from six to five and keeping the minimum child weight at the default of 1. Due to only a small change to the parameters, the performance of the new best performing model is comparable to the previous one. The mean ROC AUC from the cross validation in this gridsearch is even slightly lower, it decreased from 0.8037 to 0.8036.

Third, Aki tries to improve the parameters: subsample and colsample_bytree.

  • Subsample: the fraction of observations XGBoost will randomly take for constructing each tree. If you set this value to 0.5, XGBoost will randomly collect half of the records to grow trees. This can be used to prevent overfitting. A lower value prevents overfitting and makes the algorithm more conservative, but too small values might lead to underfitting. Typically used values are 0.4–1 and its default value is 1.
  • Colsample_bytree: the fraction of columns XGBoost will randomly take for constructing each tree. This parameter can also be used to control for overfitting. Typically used values are 0.4–1 and as its default value is 1.

Output:

The grid search above resulted in the optimum values: 0.5 for the subsample and 0.9 for the colsample_bytree. In other words, the grid search shows that the best model performance is achieved by constructing each tree based on half of the records and 90% of the features. Tuning these parameters improve the model performance from an AUC ROC score of 0.8036 to 0.8039.

Finally, Aki tries to improve the model even further by tuning the gamma and the lambda.

  • Gamma (min_split_loss): specifies the minimum loss reduction required to make a split. A node only splits when the resulting split gives a positive reduction in the loss function above the gamma threshold. It is a pseudo-regularization hyperparameter in gradient boosting. The higher the Gamma, the higher the regularization, and the more conservative the algorithm. Gamma depends on both the training set and the other parameters you use. Gamma can be any integer and its default value is 0.
  • Lambda: the L2 regularization term on weights (analogous to Ridge regression). This term is a constant that is added to the second derivative (Hessian) of the loss function during gain and weight (prediction) calculations. Lambda effects the choice of split points as well as the weight size. Although many data scientists don’t tune this parameter, it should be explored to reduce overfitting. Increasing this value will make the model more conservative. Lambda can be any integer and its default value is 1. XGBoost is also known as a ‘regularized boosting’ technique, due to its regularization terms.

Output:

This final grid search shows that the default values should not be changed. Therefore, Aki keeps Gamma equal to 0 and Lamda equal to 1, resulting in the same and final AUC ROC score of 0.8039. By tuning the model in four steps and searching for the optimal values for eight different hyperparameters, Aki manages to improve Meta’s default XGBoost from a ROC AUC score of 0.791519 to 0.8039. This results in the best set of hyperparameters, which are shown below.

Output:

Now both Meta and Aki found their final parameters for their XGBoost algorithms we can evaluate their models on the testset:

Output:

Code:

Output:

Code:

As on the trainset, Aki’s tuned XGBoost outperforms Meta’s default XGBoost. Aki’s tuning resulted in an improved ROC AUC score of 0.7149 compared to Meta’s ROC AUC score of 0.6993. As we did for Meta in the previous article, we also save Aki’s model to be able to compare results later on. Just to be sure, we quickly test if her model is saved correctly.

Output:

Code:

Output:

I hope you enjoyed reading this article and getting to know Aki. Her approach didn’t really focus on prepping the features, she focused on improving Meta’s XGBoost by tuning the hyperparameters. In the upcoming articles we investigate if we can improve her data preparations. Topics that we will look into are common data problems, dealing with high cardinality, and dealing with missing data. We will not only focus on improving model performance, but also how to improve the interpretability and explainability of the models. This is something both Andy Stand and Eqaan Librium value very much when practicing data science.

This article is part of our No Data Scientist Is The Same series. The full series is written by Anya Tonne, Jurriaan Nagelkerke, Karin Gruijs-Vodde and Tom Blanke. The series is also available on theanalyticslab.nl.

An overview of all articles on Medium within the series:

  1. Introducing our data science rock stars
  2. Data to predict which employees are likely to leave
  3. Good model by default using XGBoost
  4. Hyperparameter tuning for hyperaccurate XGBoost model
  5. Beat dirty data
  6. The case of high cardinality kerfuffles
  7. Guide to manage missing data
  8. Visualise the business value of predictive models
  9. No data scientist is the same!

Do you want to do this yourself? Please feel free to download the Notebook on our gitlab page.

--

--