Machine Learning Model Evaluation and Interpretation with Python

Răzvan Fișer
softplus-publication
7 min readOct 22, 2021

The Machine Learning Black Box

Machine Learning models are widely seen as mysterious black boxes that receive some sort of input and “spew” out an output. It is very often that one has no idea what happens inside this box, therefore the output cannot be justified. You might also explain to a non-machine-learning-enthusiast why a certain prediction was made, therefore it might be useful to explain why and how certain features “nudge” the model in a certain direction such, because it is a very straight-forward and simple way to look at the problem.

It is also worth noting that there is a trade-off between model accuracy and interpretability. The more accurate the predictions of a model are, the less interpretable it is.

The inverse relation between Performance and Explainability (Interpretability) of Machine Learning Models

Linear Models

Linear Models are one of the easiest to explain, since they take a matrix that contains our observations of each feature as input and produce a vector of weights, each corresponding to a feature (or a column) from the matrix-input. So the whole thing becomes quite simple! The larger the absolute value of a weight is, the more important its corresponding feature is. If the weight is really close to 0, then it is quite unimportant, since we could approximate the model really well by simply excluding this feature.

Let’s quickly train a Linear Regression Model on the UCI Machine Learning Red Wine Quality Dataset, which correlates various wine metrics with wine quality(I assume that you are following along in a Jupiter Notebook document):

Now that we’ve trained the model, we can actually access the weight vectors from the newly created model. Then we can simply make a bar plot to visually see the importance of each matrix.

This will produce a result similar to the plot below, from which we can gather that the density feature of our dataset is the most important and it tends to negatively impact the quality of a red wine.

A bar plot of the various weights obtained after training a Linear Regression model on the Red Wine Dataset

Interpreting Classification Models

Confusion Matrices are the most basic way of interpreting a Classification Model. They hold four important values which describe the results of a Classification Model: True Positives(TP), False Positives(FP), False Negatives(FN) and True Negatives(TN). When we plot a Confusion Matrix we would like the values on the main diagonal (TP and TN) to be as large as possible. In the following code we train a Logistic Regression model on the Breast Cancer Wisconsin (Diagnostic) Data Set which can be used to predict whether a tumor is malignant or benign, and then we plot a Confusion Matrix based on the predictions on the test set:

The accuracy score alone is not very descriptive of what happen. What if the target values are unbalanced and the model learned to predict only benign tumors? That would still lead to a big accuracy score if benign tumors are way overrepresented. This is when the Confusion Matrix comes in handy:

Confusion Matrix of Logistic Regression model on the Breast Cancer Dataset

We can see above that in fact, the values were predicted pretty well, although they were unbalanced and the model did not have any “preference” towards predicting benign tumors.

Since we are talking about cancer, notice that we’ve predicted that four people have benign tumors, although they actually had malignant ones. This can prove very damaging to a person’s life. Perhaps it would be a good idea to “nudge” the model so it has a slightly bigger tendency to predict malignant tumors, so that we catch all of the cases where people are actually very sick, even though we might falsely categorize some people as having cancer. The cost of life is greater than the cost of an additional check-up.

This is why we might need an additional metric, which is called recall. This is how we calculate it:

Source: https://towardsdatascience.com/accuracy-precision-recall-or-f1-331fb37c5cb9

Notice that when the False Negatives are small, the recall gets closer to 1. This is actually what we want to achieve when diagnosing tumors- to avoid leaving cancer untreated, because of a false negative.

On another hand, precision might be of more importance to us if we want to predict which clients will leave a mobile phone service. Companies tend to call clients which they think will leave and give them a better deal, but since they have a lot of clients, they might want to save resources and call a smaller number of clients. In this way they avoid calling and giving a better deal to people who wouldn’t have left anyways, even if in doing so, they will miss some people that are actually about to leave.

This is how to get the precision- and recall score from sklearn:

Interpreting Regression models

Besides the basic metrics, such as Mean Squared-, Mean Absolute- and Root Mean Squared Errors, which are presented in the code below, we can actually use a seaborn lmplot to compare the growth of predicted and true values. lmplot will plot them and automatically try to fit a straight line through these points. The closer is this line to a 45 degree angle, the better the predictions are. This code will apply the above-mentioned metrics and lmplot on the Red Wine Dataset:

lmplot on the Red Wine dataset

So apparently, we haven’t done a very good job on our predictions, because we can see that for small values in the true values, we get both big and small predicted values, so their rate of growth is not correlated at all. This is a sign that we must do some further work with our dataset in order to get better results.

Agnostic Methods (LIME and SHAP Interpretability)

Agnostic Methods such as LIME and SHAP do not care about what kind of model we are trying to interpret, as the name suggests. They try to describe the influence of each feature and values of said features on the outcome.

From now on we are going to work with the UCI Heart Disease Dataset, which we preprocess in an elementary fashion like the other datasets.

LIME Interpretability

We can explain a single entry from the UCI Heart Disease Dataset with LIME by first creating an explainer and then calling the .explain_instance function on it and then plotting it:

The plot of a LIME instance explanation

SHAP Interpretability

SHAP Interpretability is based on the Game Theory concept of Shapely Values. Without getting too technical, SHAP tries to explain the importance of each feature by creating a super-set from the list of all features, i.e. every possible combination of feature, including no features, every single feature and each feature on its own. Then, the model is trained on each of these combinations of features, which means training it 2^F different times, where F is the number of features in our dataset.

This means that this might require a lot of time depending on the number of features and size of our dataset.

After we create an explainer, like we did with LIME, we can create a number of interesting plots:

Bar Plot

A SHAP bar plot made on the UCI Heart Disease Dataset

This resembles the bar plot we did on the linear model above, but this time all the scores are shown as SHAP values.

Waterfall Plot

SHAP Waterfall plot on the UCI Heart Disease Dataset

This is similar to the LIME explain_instance plot above, but it more graphically shows how we start from an intermediate value(0.5) and the influence of each feature and its value on the prediction. Different values of the same feature might nudge the model in a different direction.

Bee Swarm Plot

Bee Swarm plot on the UCI Heart Disease Dataset

This one shows how different feature values (represented by color) impact the model in different ways. Points are stacked on top of one another in order to show density. Areas where a feature has both large and small values are problematic and unstable, because you cannot find an obvious correlation between its value and the direction it leads the model towards.

Force Plot

SHAP Force plot on UCI Hear Disease Dataset

The force plot again shows how on a single instance certain features push the model into a certain direction, similar to the Waterfall plot, but illustrated differently.

This concludes our article on Model Metrics and Interpretation! I hope that this article has helped to shed a little light into the mysterious black box of machine learning models!

--

--