How XGBoost Handles Sparsities Arising From of Missing Data? (With an Example)

Cansu Ergün
HYPATAI
Published in
5 min readJun 29, 2020

To deal with sparsity arising from missing data, it is crucial to handle it in the data preprocessing step. In a business setting, for example, domain knowledge and information about how a feature is stored in the database become very important before doing the right preprocessing. Let’s say we try to establish a machine learning model to predict the heights of children given the number of times they played basketball since their birth. If we do not have this feature for some observations in our data, then chasing the root cause for not seeing a value would be a good approach. Maybe some children have never played basketball, so this information is not stored in the database. In that case, if we are 100% sure that this is the only situation for not seeing value for this variable, then it would be wise to impute missing observations with a value of 0. Otherwise, missing values could come for any reason, and we should find a wise way to handle them. Most of the time, several types of imputation methods such as mean, median, mode, regression are applied before the modeling phase to deal with such cases. However, imputation methods always incorporate some bias in them, and applying them to data is just another added phase in the preprocessing step, making this step longer both for the data scientist and for the machine to process this data. This is when XGBoost’s Sparsity-aware Split Finding algorithm shows its importance since you can leave XGBoost deal with missing observations and it definitely knows what it is doing. Again, do it if you are not sure about the real meaning of a missing observation. In situations when you know the reason behind it is better to replace them with the right value (like replacing them with 0 in our tiny example).

The Sparsity-aware Split Finding algorithm is one of its characteristics that make XGBoost so powerful. With this algorithm, XGBoost handles sparsities in data, such as the presence of 1) missing data, 2) dense zero entries, 3) one-hot encoded values. To make the algorithm aware of those sparsities, XGBoost defines a default direction for them. Optimal default direction is found by trying both directions in a split and choosing the one which proposes a maximum gain. For finding the optimal direction, only non-missing observations are visited, and that results in achieving a lower computation complexity.

Now let’s visualize how XGBoost does its job when it sees missing values in data with an example.

  • In our tiny train set, we try to predict height given age. We have 6 children, 4 of whom have given ages. XGBoost starts training by putting 0.5 to predicted height by default and optimizes its prediction in each tree. It actually fits regression trees to residuals (predicted value — actual value).
  • We also have 2 children with unknown ages in our train set, XGBoost starts their initial prediction with 0.5 as well. So we can get their residual values.
  • As stated in the Sparsity-aware Split Finding algorithm, the threshold for splitting the age variable is determined by visiting the midpoint of each consecutive non-missing age value after sorting them. So our candidate thresholds become 6.5, 8, and 12. Root node contains all residuals including those of missing values. For each candidate threshold, XGBoost will try both directions for putting residuals of missing values to find their optimum direction. First, it will try it by putting them in the left node for 6.5 and will calculate the gain resulting from this split. (Note, the gain is calculated by subtracting similarity score before splitting from the sum of similarity scores obtained from each node after the split)
  • Then it will do it again for the same threshold, this time by putting residuals belonging to missing values to the right.
  • And will do the same thing for 8,
  • And for 12.
  • After all directions are tried for all possible thresholds, let’s see what we have at hand. Below bar chart shows that the default direction for missing cases is to the left of threshold 8 since we have obtained maximum gain from that direction.
  • Let’s see our winner split again.
  • Now that we determined the default direction, we could easily imagine what would happen if we see a child with no age information in prediction data as shown in the below pictures.

Hope now you have enjoyed reading about missing value treatment in XGBoost and my tiny example. 🙂

--

--