Machine Learning Model Explanation

Abubakar Alaro
The Startup
Published in
11 min readSep 4, 2020

A Case Study of Shap and pdp using Diabetes dataset

Explaining Machine Learning Models has always been a difficult concept to comprehend in which model results and performance stay black box (hidden). In this post, I seek to explain the result of a ML model I built with the help of two major libraries SHAP and Partial Dependence Plot. The dataset is originally from the National Institute of Diabetes and Digestive and Kidney Diseases. The objective of the dataset is to diagnostically predict whether or not a patient has diabetes, based on certain diagnostic measurements included in the dataset. Several constraints were placed on the selection of these instances from a larger database. In particular, all patients here are females at least 21 years old of Pima Indian heritage.

The dataset consists of several medical predictor variables. Predictor variables include the number of pregnancies the patient has had, their BMI, insulin level, age, and so on. The major goal of the data was to answer the question “Can you build a machine learning model to accurately predict whether or not the patients in the dataset have diabetes”.

The steps took to build the machine learning model are as follows:

  1. Exploratory Data Analysis (Uni-variate and Bi-variate Analysis)

2. Feature Engineering and Feature Selection

3. Baseline and comparison of several ML models on a metric

4. Performing hyperparameter Tuning on the Selected Algorithm

5. Interpretation of the model Result

Exploratory Data Analysis

Exploring Data is a paramount and important step for any data science project. This will enable you (Data Scientist) to understand the kind of features you are working with and how it relate to the question you are trying to answer. EDA can be broadly split into Uni-variate Analysis and Bi-variate Analysis.

Uni-variate Analysis broadly entails checking the feature distribution, this can help answer the question if the feature distribution is skewed or if there are missing values in the feature.

Bi-variate Analysis is concerned with two features to explore how they relate with each other. Generally, it is advised to take one (1) feature and the target variable as this will save time and effort especially if a model is to be built. The complete analysis can be found here

Statistical feature Summary

Let us Understand the describe function of pandas
Quick Question: What does it mean to have a standard deviation (std) close to the mean? Firstly, what is std? Standard deviation is a number used to tell how measurement for a group are spread out from the average (mean), or expected value. A low std means that most of the numbers are closer to the meanwhile a higher std means that the numbers are more spread out.
So to answer the question what it means to have a std closer to mean like mean is 100 and std is 50, then the data in this group are widely spread out and if a group has mean = 100 and std = 10, then the data is marginally spread out that is the data has a long cone when visually checked out, that is the data are within a small range.

Another Quick Question: What do you understand by 25% and 75% of a group of data?
so to answer that we need to know what percentile really means? so what is percentile?
Percentile is defined in various ways, percentile is a number where a certain percentage of scores fall below or equal to that number. so in a test you score 67 out of 90 and you fall into the 90th percentile, that means you scored better than 90% of people who took the test. the 25th percentile is called first quartile, the 50th percentile is called the median and 75th percentile is called the third quartile. The difference between the third and first quartile is the interquartile range.

So to answer the question, 25% means 25% of the data falls below that value and the 75% means 75% of the data falls below that value.

According to the site where the data was gotten from, 0 are used to represent missing values so when we check for missing with isnull() we get nothing we need to check for zero and need to be careful doing so too because we don't want to change the target values.

Distribution of pregnancy count
Box plot Glucose distribution separated by target value
Pregnancies count separated by Target values

Feature Engineering

This is the process of generating/ adding new features to already existing features either by aggregating or transforming features in other to improve the performance of the model. Log transform was taken to correct the data skewness.

df['log_SkinThickness'] = np.log(df['SkinThickness'])df['log_Insulin'] = np.log(df['Insulin'])df['log_BMI'] = np.log(df['BMI'])df['log_Age'] = np.log(df['Age'])df.head()
Feature Engineering
df['BloodSugar'] = np.abs(df['Insulin'] - df['Glucose'])df['high_BMI'] = df['BMI'].apply(lambda x: 1 if x > 30 else 0)cm = np.corrcoef(df[cols].values.T)f, ax = plt.subplots(figsize=(14, 12))sns.heatmap(cm, vmax=.8, linewidths=0.01, square=True, annot=True, cmap='viridis', linecolor='white',xticklabels=cols.values, annot_kws={'size':12}, yticklabels=cols.values)

Select Features for model

cols = cols.drop(['Outcome'])features = df[cols]target = df['Outcome']features.head()
Features for modelling

Performance Metric = Accuracy

The baseline for this model based on the chosen metric will be 70%. What this means is that any algorithm giving accuracy less than 70% will be discarded, at the long run the algorithm with the highest accuracy score will be selected. The data was scaled using Standard Scaler to reduce model bias.

Establishing a Baseline Model

A Baseline Model is a simple model that is to be improved. It is basically, building a model using the default parameters of an algorithm to understand the performance and detect some important features. Some Models that were built include Logistic Regression, Random Forest, Gradient Boosting, K-nearest Neighbors.

Logistic Regression

log_reg = linear_model.LogisticRegression()list_scores = []log_reg.fit(features, target)log_reg_score = cross_val_score(log_reg, features, target, cv=10, scoring='accuracy').mean()print(log_reg_score)list_scores.append(log_reg_score)0.7695488721804511

KNeighbours Classifier

cv_scores = []# --- number of folds ---folds = 10#---creating odd list of K for KNN--ks = list(range(1,int(len(features) * ((folds - 1)/folds)), 2))#---perform k-fold cross validation--for k in ks:knn = neighbors.KNeighborsClassifier(n_neighbors=k)score = cross_val_score(knn, features, target, cv=folds, scoring='accuracy').mean()cv_scores.append(score)#---get the maximum score--knn_score = max(cv_scores)#---find the optimal k that gives the highest score--optimal_k = ks[cv_scores.index(knn_score)]print(f"The optimal number of neighbors is {optimal_k}")print(knn_score)0.7747436773752565list_scores.append(knn_score)The optimal number of neighbors is 17

Other Models that were built include: Gradient Boost Classifier, Support Vector Machine and Random Forest.

Selecting Best Model

model performance based on accuracy

K-Neighbor performance as measured by accuracy is the best model and i moved to hyper-parameter tuning.

Performing Hyper parameter Tuning on the Selected Algorithm

cv_scores = []# --- number of folds ---folds = 30#---creating odd list of K for KNN--ks = list(range(1,int(len(features) * ((folds - 1)/folds)), 2))#---perform k-fold cross validation--for k in ks:    knn = neighbors.KNeighborsClassifier(n_neighbors=k)    score = cross_val_score(knn, features, target, cv=folds,            scoring='accuracy').mean()   cv_scores.append(score)#---get the maximum score--knn_score = max(cv_scores)#---find the optimal k that gives the highest score--optimal_k = ks[cv_scores.index(knn_score)]print(f"The optimal number of neighbors is {optimal_k}")print(knn_score)# result.append(knn_score)The optimal number of neighbors is 19New Accuracy after Hyper-parameter tune is 0.80

Interpreting the model Result

Permutation Importance using ELI5 library

What features does a model think are important? What features might have a greater impact on the model predictions than the others? This is called feature importance and permutation importance which are techniques used widely for calculating feature importance. It helps to see when our model produces counter-intuitive results and it helps to show the others when our model is working as we’d hope.

The idea is simple: Randomly permutate or shuffle a single column in the validation dataset leaving all the other columns intact. A feature is considered important if the model’s accuracy drops a lot and causes an increase in error when it not included as a feature to build the model. On the other hand, a feature is considered ‘unimportant’ if shuffling its values do not affect the models accuracy.

Permutation importance is useful for debugging, understanding your model, and communicating a high-level overview from your model. Permutation importance is calculated after a model has been fitted. ELI5 is a python library which allows to visualize and debug various ML models using unified API. it has built-in support for several ML frameworks and provides a way to explain black-box models

# calculating and Displaying importance using the eli5 libraryplt.figure(figsize=[10,10])import eli5from eli5.sklearn import PermutationImportanceperm = PermutationImportance(model, random_state=1).fit(X_test, y_test)eli5.show_weights(perm, feature_names=X_test.columns.tolist())
Feature Importance using ELi5

Interpretation

The features at the top are the most important and at the bottom, the least. For example, Glucose level is the most important feature which can decide whether a person will have diabetes, which also makes sense.
The number after +/- measures how performance varied from one reshuffling to the next. Some weights are negative which means that the feature does not have high impact in deciding whether or not patient has diabetes.

Partial Dependence Plots

The partial dependence plot (pdp) shows the marginal effect one or two features have on the predicted outcome of a ML model. PDP shows how a feature affects predictions. PDP can show the relationship between the target and the selected features via 1d or 2D plots.

from pdpbox import pdp, get_dataset, info_plotsfeatures_name = [i for i in features.columns]pdp_goals = pdp.pdp_isolate(model=model, dataset=X_test, model_features=features_name, feature='Glucose')pdp.pdp_plot(pdp_goals, 'Glucose')plt.show()
Glucose level impact as explained by pdp

From the chart above, glucose level below 100, the probability of a patient having diabetes is low. As the Glucose level increases beyond 110, the chance of patients having diabetes increases rapidly from 0.2 through 0.4 at Glucose level 140 and impulsively to 0.8 at Glucose level 200.

Glucose level beyond 110 poses patients at high risk of having diabetes while if below 100, the risk is low.

pdp_goals = pdp.pdp_isolate(model=model, dataset=X_test, model_features=features_name, feature='log_Insulin')
pdp.pdp_plot(pdp_goals, 'log_Insulin')
insulin chart

The Y-axis represents the change in prediction from what it would be predicted at the baseline or leftmost value. Blue area denotes the confidence interval.
For the ‘Insulin’ graph, we observe that probability of a person having diabetes slightly increases continuously as the Insulin level goes up and then remains constant.

pregnancy count chart

The chart above says if number of pregnancies is greater than 7, probability of having diabetes will increase by 0.1. This means that, as the number of pregnancy increases beyond 8, the probability of having diabetes also increases.

age pdp chart

From the Age plot above,it can be observed that younger people has lower chance of having diabetes. The age range (27–29) experience a slight downward trend in probability of having diabetes. According to healthline, Middle-aged and older adults are still at the highest risk for developing type-2 diabetes. In 2015, adults aged 45 to 64 were the most diagnosed age group for diabetes and it conforms to what the model explains as the probability of having diabetes continuously increases as the age of patient increases beyond 49 years.

Using SHAP values to explain model results

SHAP stands for SHapley Additive exPlanation. It helps to break down a prediction to show the impact of each feature. it is based on Shapley values, a technique used in game theory to determine how much each player in a collaborative game has contributed to its success. Normally, getting the trade-off between accuracy and interpreting model results the right way can be a difficult task but with SHAP values, it easy to achieve both.

SHAP is a game theoretic approach to explain the output of any machine learning model. It connects optimal credit allocation with local explanations using the classic Shapley values from game theory and their related extensions.

The plot shows Glucose to have much impact on having diabetes and pregnancies follows suite in decreasing order

At a lower Glucose level and a small number of pregnancy, the chance/probability of having diabetes is low. As the value of Glucose level increases the probability also increases which further suggest that higher Glucose level means higher probability of having diabetes.

This shows that Glucose interact with Pregnancies frequently

Glucose and number of pregnancy are good predictors to determine if a patient has diabetes or not and it can be concluded that the higher their values are the higher the chance of having diabetes.

With the chart above, you can interact with some of the features used to build the model and see how each vary and affect the decision of the model.

Recommendation and Conclusion

I will like to start with knowing about the data and give a summary about the findings

Know about the data

Explain Findings

Recommendation

Know about the data This dataset is originally from the National Institute of Diabetes and Digestive and Kidney Diseases. The objective of the dataset is to diagnostically predict whether or not a patient has diabetes, based on certain diagnostic measurements included in the dataset.

Can you build a machine learning model to accurately predict whether or not the patients in the dataset have diabetes or not? I believe i have answered this question.

Explain Findings Features like Glucose and Blood Pressure are normally distributed while Features like Age, BMI, Insulin, Skin Thickness are Skewed. The Skewed features was handled by taking the logarithm of the values and that was how it was use to build the model.
While Interpreting the result of the model, Assumptions like if pregnancy is more than 7, increases the chance of having diabetes was verified to be true. Also assumption that: if glucose level is high i.e glucose > 140, then probability of having diabetes is high or increases was also verified to be true.
it was discovered that features like Glucose contribute to people having diabetes, also high BMI will increase the chance of having diabetes. if Insulin level is high, the chance of having diabetes will reduce. Also, a low level of Glucose and small value of Pregnancy will reduce the chance of having diabetes.

Recommendation

a. Always do Cardio at least 20 mins per day to reduce the Glucose level

b. Eat bitter leafs and reduce the intakes of sweet things like snacks

c. Do Child Control and use condoms if you do not wish to get pregnant

d. Tell your Husband he don do nah you wan kill me :)

This was put together as part of effort to explain black box machine learning models, feel free to reach me on Twitter.

Special Thanks to Yabebal, 10Academy Team and Babatunde. The complete notebook can be found here

--

--