Make a Model interpret like a Human

Roma Jain
EdTech in-depth | iSchoolConnect
6 min readJul 7, 2020
Machine Learning model interpretability of the predictions from features

Most machine learning engineers build a model, evaluate a model, and then decide to either switch to a gaming tab or feel good about the results, take a screenshot of the results and post it. But, have you ever felt that your models can be trustworthy although they are not domain experts? If not, let’s dive into how you can build trust with your model and collaborate with your baby product to make both of your lives better 😄

Popular machine learning models are more accurate than traditional rule-based models because they learn higher-level representations of raw data. But, it makes interpretations difficult. Why?

Your baby told you that she predicted that this toy is going to be her Superman. When you asked her why she says that

‘’Cz it is a weighted combo of attractive looks, hair color, and manly voice. You know all that. Well, you hooman will not understand. It’s complex’. Lol.’

Exactly, non-convex error surfaces with no clear global minimum to make you reach your goal will then have multiple ways to converge to the target. And that’s why no single non-linear non-monotonic model can explain it as human experts do.

Non-convex error surface means nothing but the relationship between input variables and target variables is not linear.

Non-monotonic models imply that the increase or decrease in input variables does not have a fixed increase or decrease in the output value at a definite rate. The rate of change is not fixed.

Well, when you have domain knowledge about the data, it becomes easier to constrain your model to become monotonic models. When there is a positive correlation between an input variable and your target variable, that essentially means you can constrain your model to learn this positive monotonic relationship for that input variable and the target variable.

XGBoost models with monotonicity constraints when applied ensures that you can help your baby explain better.

But, how does it work?

When the tree grows, at a certain node, a feature gets picked because of high gains. If the relationship is positively set for this feature with the target & if the weight of the right child is higher than that of the left child, then it allows the split. Else, the XGB model sets the gain to negative infinity which makes it abandon the split to maintain positive correlation constraint.

Now, how to ensure that along with positive & negative correlation constraints, the rate of decrease/increase relationship is kept constant (monotonicity check)?

Well, the XGB model checks one more condition. If the parent node’s gain for the right child is higher than the right child (of the split about to be made), then it means that higher weight was given to the higher value of the feature at the parent node. Because when trees make splits, the node level relationship also matters with the descendant level relationship which needs to be constant across a branch.

Let’s see this in action: Consider a data where the ground truth is either yes or no & the features have different correlations with the target variable and you know this as domain knowledge goes.

After calculating correlations with the target variable, you can see that f18 has the most negative correlation and f3 has the most positive correlation, which is in fact what the domain knowledge also wants from the model.

Calculate monotonicity constraints by simply looking at the sign & assign 1 or -1.

XGB model has ‘monotone_constraints’ as a parameter which is set to our data’s constraints

After training, you can look at the variable importance.

GBM variable importance is a global measure of the overall impact of an input variable on the GBM model predictions. Global variable importance values give an indication of the magnitude of a variable’s contribution to model predictions for all observations. To enhance trust in the GBM model, variable importance values should typically conform to human domain knowledge and reasonable expectations.

XGB’s plot_importance tells you the number of occurrences in all the splits for a feature when ‘importance_type’ is set to ‘weight’ by default.

Now, to see the effect of one feature on the model’s prediction, Partial dependence plots are used. It helps you understand if the input variable has linear, monotonic, or complex rishta with the predictions. Take feature S: features you are interested in & take features C: the ones you don’t want to calculate dependence for and keep them intact.

To understand PDP better, an excerpt from Christoph Molnar’s book :

The partial function tells us for the given value(s) of features S what the average marginal effect on the prediction is. In this formula, features C are actual feature values from the dataset for the features in which we are not interested, and n is the number of instances in the dataset. An assumption of the PDP is that the features in C are not correlated with the features in S. If this assumption is violated, the averages calculated for the partial dependence plot will include data points that are very unlikely or even impossible.

In simpler terms, if you play around with a feature that is highly correlated with other features in the dataset, then it essentially means you are distorting the reality of the data. If you change one feature keeping the other correlated feature unchanged, it essentially changes the relationship between the two into an ‘unknown feature’ or unrealistic data.

The feature value xs are binned according to their min and max values. Then, for each batch of say 20 test data rows, PDP is calculated as the mean prediction probability of the model after the feature value of each 20 test rows is changed to a particular bin value.

The basic idea behind PDP is if you vary one feature, the change in the predictions of the model tells you how much the model gets impacted with a change in this feature while keeping the other features intact.

The red curve shows the global relationship between the feature f18 and the target variable. The decreasing curve between the model’s prediction probability and the feature’s value indicates that the interpretation of this feature negatively affecting the model is proved!!

The nonlinear curve but monotonically decreasing curve is now interpretable. I can now turn this baby product into explainable text insights termed as ‘reason codes’.Reason codes are nothing but sentences that summarize your data to a customer.

An interesting advantage is reason codes can help you understand if high weight is being given to problematic inputs including gender or age-related features to remove the model’s bias towards sensitive features.

The local dependence between the feature at each test row and the target variable is shown by ICE plots. We will look into ICE plots in upcoming posts! If you liked this article, make sure you check out the other interesting blogs on iSchoolConnect’s Medium page:)

--

--