Explainability of the features? No! Of the hyperparameters.

Dany Srage
get-defacto
Published in
8 min readMar 15, 2024

As data scientists, we enjoy training models, but the enthusiasm often fades when it’s time to fine-tune the model’s hyperparameters after preparing all the features.

We’ve all been in situations where a project stakeholder asks if we can add a feature or change tiny details but it turns out it’s much more time-consuming than they thought because we need to run the optimization again.

Although there’s an abundance of definitions for the hyperparameters of various models, translating them into practical implications can be challenging. The purpose of this article is to provide a tangible method and understanding of how hyperparameters influence model performance, highlighting the crucial ones and explaining their interactions. It also recaps how to explain an algorithm with respect to its features.

Even though on their own each hyperparameter is very important, since many of them contribute to reducing overfitting versus model complexity, we need to find a balance between all of them and I hope this analysis will help in finding it.

Methodology

Models that we put in production need to be explainable: we need to understand how each feature impacts the overall predictions.

The whole idea of this article is that instead of explaining the features, why wouldn’t we explain the hyperparameters? That way, we could see how each one impacts the model performances.

Indeed, when we take models such as linear regressions, the prediction function looks like this:

where the w_i are the coefficients and x_i are the features. If you think about it, the features and the parameters of the models are almost symmetrical. So what if we swap their roles in our explanation models?

A common method to interpret a model is to use the SHAP approach. It’s based on game theory and in short, for each prediction, it tells how each feature contributed to it. You can find more information here.

To use this method, we first need to build a Machine Learning model that predicts the model performances from the hyperparameters. Once we validated that this model gave very good predictions, we can observe and learn from its SHAP summary. We’ll see below an example.

We’ll focus on XGBoost algorithm as it’s one of the most widely used algorithms and has many hyperparameters. To illustrate our findings, we conducted analyses on both an internal dataset and the renowned Titanic dataset. While a comprehensive analysis across multiple datasets would be ideal, we hope this analysis will still help better understand these hyperparameters and the methodology behind. Being classification problems and our internal dataset being imbalanced, I focused on the average precision metric.

To keep the article concise, we won’t delve extensively into the definitions of all the parameters of XGBoost, as you can find detailed descriptions in the official documentation here.

Let’s now look at the SHAP summaries:

Generating these summaries is as easy as the following in Python.

import shap

explainer = shap.TreeExplainer(model) # model is a trained sklearn model where
# y_i are the performances of the model
# and x_i are the hyperparameters
shap_values = explainer.shap_values(X_test) # X_test is our test data
shap.summary_plot(shap_values, X_test)

Below you can see the SHAP summaries for the two datasets we’ll analyze.

SHAP Summary for the Titanic dataset
SHAP Summary for our internal dataset

How to read SHAP summaries?

As a reminder of how SHAP summaries work, each line on this graph represents a feature, and features are sorted by importance. Each dot for a given feature represents a single prediction and its position on the x-axis represents its marginal impact on the prediction. If a feature has a SHAP value of -0.01, it means that this feature value causes the prediction to decrease by this amount.

On the right, you can find a scale for feature values. If the dot is red, it means that the value for this feature is very high. And if this point is more to the right, it means that high values for a feature impact positively the model.

Note that when many predictions have the same contribution for a given feature, that’s when the line starts to get much larger (the points start agglomerating on the summary).

Insights and learnings

Before diving into each hyperparameter one by one, the first insight we have is that the most important features to optimize are:

  • learning rate, min_child_weight, colsample_by_tree and the max depth

while the least important one seems to be:

  • gamma

💡Note how the impact of min_child_weight and n_estimators differs from one dataset to the other.

Let’s now see what we can learn about each parameter one by one. For this, we generated SHAP dependency plots using the following:

for i in range(len(X_test.columns)):
shap.dependence_plot(i, shap_values, X_test)

Insights about hyperparameters:

Learning rate: it seems to be both times the most important feature. The red part of the graph is on the left, meaning that the larger the learning rate is, the less performing the model.

This parameter controls how each new tree in the boosting will learn from the residual error. A smaller value means that the model will take more time to learn and will need more trees in the boosting. A higher value on the opposite will converge faster but might miss an optimal result. It’s therefore why according to our Shapley values, the feature it interacts the most with is the number of estimators (see the n_estimators part and graph).

We can see below the dependency plot of the performances as a function of the learning rate. We can see that the highest learning rate(purple line) tends to overfit quickly, while the most performing models seem to be the one with the lowest learning rates, even though it requires more iteration to get the best performances.

Max depth: it’s the maximum depth of each tree. As can be seen in the plot below, a value too high will lead to overfitting because the tree will be too complex. On the opposite, if too low, the model will be too simple leading to worse results.

colsample_by_tree: the ratio of features to sample for each new tree. SHAP automatically displays the features it interacts the most with by coloring each point with this feature value (see graph below). For this feature, we can see in the figure below that the feature it interacts the most with is the learning rate. It seems that if the number of columns to sample is low (0.5), having a lower learning rate will help the model to perform better.

Intuitively, when the column sampling is low, each tree is very different from each other. It is, therefore, better and “more stable” to learn slightly from each additional tree, hence a lower learning rate being optimal.

This shows that if one wants to have a low sampling rate, it’s important to keep lower learning rates in our search space.

n_estimators: the number of trees in our boosting algorithm. There are two learnings: the more trees, usually the better. One could easily cut the space to search for hyperparameters by stopping when the below curve becomes flat:

The second learning is that the parameter it interacts with the most is the learning rate as discussed above. With a low number of estimators, it’s better to have a high learning rate because the model won’t have too much time to learn during training. On the contrary, with a high number of estimators, we can afford a lower learning rate which gives better performances overall.

subsample: the ratio of training data to sample before training the tree. We see very different behaviors between the two datasets, where in one the higher the better, and in the other the reverse is true. Subsample seems to highly depend on the dataset. The parameter it interacts the most with is the learning rate. The reason is similar to colsample_by_tree.

min_child_weight: controls how much “instance weight” is needed in a child, with a higher value meaning less split and thus less overfitting. It seems much more important in the Titanic dataset than the other one.

Respectively we can see its impact on our internal dataset and the Titanic dataset in the graphs below:

Two things strike me:

  • the feature with which it interacts the most changes with the dataset.
  • High values always decrease performances on the Titanic dataset while this is not true in our internal one.

This parameter seems thus highly dependent on the dataset.

gamma: it represents the minimum loss reduction needed to split the tree. In both datasets, this seems to be the less impactful parameter and for our use case, it could be dropped to help reduce the search space when trying to get the optimal solution.

Validating the results

Since this analysis is based on a model that itself has some flaws, I wanted a more classic method to understand how each hyperparameter impacts the model performance. For each hyperparameter, I created a partial dependency plot with respect to every other studied hyperparameter and I then looked at the range of the evaluation metric. The hypothesis is that if a hyperparameter is important, changing its value leads to an important variation in the evaluation metrics. We got the following results:

Results were very similar except for the Titanic dataset where min_child_weight appears to be much more important than in SHAP.

Conclusion

I hope that with this article you get:

  • a better understanding of how to study the explainability of models.
  • a better understanding of each hyperparameter and how changing their values can impact the model results.
  • an overall method to get a global vision of your search space of hyperparameters and how to optimize it.

As a recap, my top learnings from this analysis are:

Obviously, these learnings might differ with different datasets, and running such an analysis with your dataset might be helpful to save time down the line.

There are other methods to optimize hyperparameters that were not covered in this article such as Bayesian Search which are also worth looking into.

Thanks to Yohan, Data Scientist at Defacto, and Marc-Henri, CTO of Defacto, for their feedback.

--

--