Implementation of XGBoost on Income Data

Cansu Ergün
HYPATAI
Published in
5 min readAug 27, 2020
A scene from Money, Money, Money video clip — Abba

After grabbing some theory behind XGBoost framework in the previous stories What Makes XGBoost Fast and Powerful? and How XGBoost Handles Sparsities Arising From of Missing Data? (With an Example), now it is time to get our hands dirty and and apply it on a small dataset for classifying and predicting income level.

Let’s say you have applied for a mortgage loan. It is important for the bank to assess your credibility and they think that your income is a good indicator whether to lend you the money. You knew that you are pretty much new to this bank. So at most, maybe they could find some demographic information about you, but not more than that. That makes the current bank you have applied your mortgage, hard to assess your worthiness. So you lied, told them you earned a lot and you had a fantastic job to get the money. But somehow they checked their limited records about you and found out that you make less than 50k in a year and nooo!, you’ve been rejected. 👎🏼

But how did this happen? Wait, what? They have an XGBoost classifier model to check your income level? But how? Here comes the how part below: 😜

Data used in the model, and the code for preprocessing and model building used in this story could be found on my github page here.

There will be other stories on Hypatai, showing the implementation of other frameworks such as H2O and LightGBM with this same dataset, therefore data preparation notebook on github is designed to produce different model data, each corresponds to the related algorithm.

Now we are ready to start!

Let’s read the prepared data, ready for the modeling phase:

With a quick approach and keeping mostly the default levels, let’s set our model hyper parameters.

  • max_depth: Maximum depth for each tree. Default value was kept here.
  • eta: Learning rate. Shrinks the feature weights to make the boosting process more conservative after each boosting step. Default value was kept here.
  • objective: Since we are dealing with a problem with a target of income levels ≤50K and >50K, this is a binary classification problem.
  • seed: This is needed to reproduce same results each time we train the model.
  • min_child_weight: As stated in the story What Makes XGBoost Fast and Powerful? in the part where it explains Weighted Quantile Sketch algorithm , weight (or cover) in XGBoost is the number of points in the node (or leaf) for regression problems. Therefore, maintaining equal weight among quantiles is the same thing as maintaining an equal number of observations for regression. In other words, quantiles in regression are just ordinary quantiles. For classification problems, however, it has a different calculation. Namely, badly predicted instances have higher weights, and to maintain the equality of the weight in each quantile, instances with large residuals will go into more specialized quantiles, leading those quantiles to have less number of data points. This results in an increase in accuracy by possibly increasing the number of quantiles. Therefore min_child weight is the minimum threshold for the sum of hessians in each child. The higher this parameter, the less prone the model is to overfitting. Default value is 1 for XGBoost. Here I used a value of 5 with a quick trial and error.
  • n_estimators: Number of gradient boosted trees. Equivalent to number of boosting rounds. Default value is 100. Since we have about 104 parameters to train the model, I increased this value to 250 to avoid from underfitting.

There is of course always some other parameter to tune, however to keep the example simple I stopped here and skipped to the training part.

Data was already shuffled in the preprocessing step, so I used a 80 to 20 ratio for train and validation sets in this phase. The metric for evaluating model performance is area under the curve. I also set 20 for early_stopping_rounds, meaning if our metric does not improve for 20 rounds anywhere before 250 rounds (this was the value we set for n_estimators parameter), then our training will stop.

Here comes the point where our training ended:

Seems like we did not need to set our n_estimators hyper parameter that high, thankfully early_stopping_rounds did a good job here and saved us from going further.

Our validation set shows that we have got about 93 % auc value and it does not seem to be a bad result. We will put another story on how to calculate and interpret various model performance metrics for classification problems soon, therefore I would like to continue with getting more idea on what kind of model we have constructed. Here is the feature importance list showing top features of our model, calculated from total gain generated from all trees.

Let’s define our function for plotting top ten features of our model:

Looks like whether to be a married spouse (this one is a one-hot encoded parameter) and capital gain are the most important parameters after age variable when we need to predict the income level.

Next story will be all about AUCs, optimal probability thresholds, confusion matrices, and other metrics to evaluate our model performance on validation and test sets.

Stay tuned!

--

--