The fundamental series of articles for applying explainability in your ML models: Part 1

Published in

LatinXinAI

7 min readNov 13, 2023

As the first part of the series, here you will see all the most advanced methods to carry out a perfect compression of the predictions that your model is making. This way, you can use this information to turn it into value.

In this article, first, I will discuss the different approaches we want to take to understand our model, and then I will demonstrate how to implement the simulations method with Python.

METHODS 📋

Global Models:

Global interpretability methods focus on understanding the model as a whole and provide an overview of the relative importance of features in the model’s decision-making.

Shapley Values
Permutation

Local Models:

Local interpretability methods focus on understanding how a specific decision is made for an individual data instance. These methods aim to provide specific explanations for a particular prediction, without necessarily considering the global structure of the model.

Lime

Let’s dive🥽

Shapley values

Let´s understand it using an example

We are under the scenario of detecting a person’s default, and we have simple variables to try to infer it. Let’s take, for example, variables such as age, income, credit history, and debt. First, we look at how the default decision is influenced by each variable separately (playing alone). Then, we examine how combinations of two variables affect it (for example, age and income). Finally, we observe how they all together impact the decision.

Contribution Measurement: For each combination, we measure how much the default decision changes by adding an additional variable. How much does each variable contribute to the outcome?

Fair Distribution: We calculate a weighted average of these contributions, considering how many times each variable appeared in different combinations.

Final Result: Each variable (age, income, credit history) receives a value indicating how much it contributes to the decision of predicting default or not.

Permutation

Each feature in your data (such as age, income, credit history, etc.) contributes in some way to the model’s ability to make predictions. The feature permutation method is like an experiment. You take a feature, let’s say “credit history,” and shuffle (permute) its values randomly so that now ages, incomes, and so on are no longer associated with the same individuals in your data. Then, you observe how this shuffling affects the model’s performance.

If, after permutation, the model performs significantly worse in its predictions, it indicates that the “credit history” feature was originally crucial for the model. If it doesn’t change much, perhaps that feature was not as crucial. In summary, the permutation method helps you understand how much your model depends on each feature.

Lime

The process for the implementation of LIME is as follows:

Sample Creation: LIME starts by creating slightly modified versions of the point of interest by introducing small random perturbations to the data. These perturbations simulate different versions of the same client.
Local Predictions: For each perturbed version of the client, LIME uses the original machine learning model to make predictions, obtaining a set of local predictions.
Simple Explanatory Model: LIME fits a simple explanatory model, such as a linear regression, to mimic the behavior of the original model in the vicinity of the point of interest. This simple model is easily interpretable.
Feature Importance: By observing how local predictions change when varying the features, LIME determines the relative importance of each feature in the model’s prediction for the point of interest.

Interaction between features

A significant difference between individual contribution and joint contribution suggests a substantial interaction between variables. In other words, the effect of variables together is not simply the sum of their individual effects.

Strong Interaction: If the joint contribution is much greater or much less than the sum of individual contributions, this indicates a strong interaction. Strong interaction implies that variables are influencing each other in a way that cannot be fully explained by considering their effects separately.

Complementarity or Competition: You can interpret the direction of the difference. If the joint contribution is significantly greater than the sum of individual contributions, it could indicate complementarity (variables work together to amplify their impact). If it is significantly less, it could indicate competition (variables compete with each other).

Simulations

Compare with logical/expected results.
Find variable importance.
Establish scenarios: How do probabilities change under certain scenarios?
Validate your model.

Errors

Error Analysis: Analysis of errors due to the individual absence of each variable.
Calculation of Individual Log Loss
Calculation of SHAP Values
Transformation of SHAP Values to Probabilities
Calculation of Mean Contribution
Calculation of Log Losses by Removing Each Variable
Calculation of Mean Losses

LET'S GO PYTHONIC 🐍

I will explain step by step the simulation method.

LOAD DATA (Easy and classy imports)

import numpy as np
import xgboost
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import matplotlib.pyplot as plt
# DATA EXAMPLE GENERATION
np.random.seed(42)
data = pd.DataFrame({
    'Edad': np.random.normal(30, 5, 1000),
    'Ingresos': np.random.normal(50000, 10000, 1000),
    'Deuda': np.random.normal(2000, 500, 1000),
    'Historial_Crediticio': np.random.choice(['Bueno', 'Regular', 'Malo'], size=1000),
    'Impago': np.random.choice([0, 1], size=1000)
})

# FROM CATEGORICAL TO NUMERICAL
data['Historial_Crediticio'] = data['Historial_Crediticio'].astype('category').cat.codes.astype('int')

# DIVIDE INTO X AND y
X = data.drop('Impago', axis=1)
y = data['Impago']

# Train , test 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Random forest model
model = RandomForestClassifier(random_state=42)
model.fit(X_train, y_train)

# Model evaluation
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy}')

2. CREATE FUNCTION

# Function to perform Risk Sensitivity Analysis
def risk_sensitivity_analysis(model, instance, feature_ranges, n_simulations=1000):
    """
    Performs Risk Sensitivity Analysis for a given instance.
    """
    simulations = []  # List to store simulation results
    lista = [0, 1, 2]  # List for the 'Historial_Crediticio' feature
    perturbed_data = pd.DataFrame(columns=instance.index)  # DataFrame to store perturbed data

    # Perform n_simulations perturbations
    for _ in range(n_simulations):
        perturbed_instance = instance.copy()

        # Perturb each feature based on specified feature_ranges
        for feature, (lower, upper) in feature_ranges.items():
            if feature == 'Historial_Crediticio':
                perturbed_value = np.random.choice(lista)
            else:
                perturbed_value = np.random.uniform(lower, upper)

            perturbed_instance[feature] = perturbed_value

        # Add perturbed instance to the DataFrame
        perturbed_data = pd.concat([perturbed_data, perturbed_instance.to_frame().T], ignore_index=True)

        # Make predictions on the perturbed instance and store the probability of positive class
        simulations.append(model.predict_proba(perturbed_instance.values.reshape(1, -1))[:, 1])

    # Combine perturbed data and simulation results into a DataFrame
    return pd.concat([perturbed_data, pd.DataFrame(simulations, columns=['Predicted_Probability'])], axis=1)

# Specify features and ranges for Risk Sensitivity Analysis
feature_ranges = {
    'Age': (25, 35),
    'Income': (45000, 55000),
    'Debt': (1500, 2500),
    'Credit_History': (0, 2)  # Codes for Good, Regular, Bad
}

3. LETS ANALYZE A REAL CASE

instance_to_analyze = X_test.iloc[0]
simulations = risk_sensitivity_analysis(model, instance_to_analyze, feature_ranges)

y_test_prueba=pd.DataFrame(y_test.tolist(),columns=['real prob'])
X_prueba=X_test.copy().head(1)
X_prueba
print(y_test_prueba.loc[[0]])

PREDICTION

4. UNDERSTAND THE VARIABLES

import seaborn as sns

sns.pairplot(data=data, hue='Impago', vars=['Ingresos', 'Edad'])

X_referencia=pd.concat([X_prueba.reset_index(),y_test_prueba.loc[[0]]],axis=1)
print(X_referencia)

5. WHAT HAPPENS IF WE CHANGE DATA?

simulations

Here it is 1000 rows of simulated data based on this case and how feature changes affect prediction

6. IMPLEMENTATION TO SEE REAL IMPACT

from sklearn.metrics import log_loss

def scenario_analysis(original_instance, simulations, model, target_column):
    """
    Conducts a scenario analysis by comparing a specific record with simulations.
    
    Parameters:
    
    original_instance: DataFrame with the original record.
    simulations: DataFrame with simulations.
    model: The trained model.
    target_column: Name of the target column.
    Returns:
    
    DataFrame summarizing predictions for the original record and simulations.
    """

    # Original prediction
    original_prediction = model.predict_proba(original_instance)[:, 1][0]
    # Original Log loss 
    original_log_loss = log_loss([y_test.values[0]],[original_prediction],labels=[0, 1])

   
    # df creation  with original sceneario kdata
    results_df = pd.DataFrame({'Escenario': 'Original', 
                                    'Predicción': original_prediction, 
                                    'Log Loss': original_log_loss},index=range(3)
                                   )

    # Generation of simulations
    for i, simulated_instance in simulations.iterrows():
        # Prediction calculation of simulated scenario
        simulated_prediction = model.predict_proba(simulated_instance.values.reshape(1, -1))[:, 1][0]

        # Add log loss to simulated scenario
        simulated_log_loss = log_loss([y_test.values[0]], [simulated_prediction],labels=[0, 1])
        diff_variables = simulated_instance - original_instance.values[0]
        diff_variables = diff_variables.add_prefix ('dif_')


        # Append info to df
        results_df = pd.concat([results_df, pd.DataFrame({'Escenario': [f'Simulación {i+1}'],
                                                           'Predicción': [simulated_prediction],
                                                           'Log Loss': [simulated_log_loss],**diff_variables})],
                               ignore_index=True)

    return results_df


# Select a X_test instance
index_of_interest = 0

# Get riginal data and simulations
original_instance = X_test.iloc[[index_of_interest]]
simulations = simulations 

# Scenario analysis results
analysis_results = scenario_analysis(original_instance, simulations.iloc[:,:-1], model, 'Impago')

print(analysis_results)

Here are the changes in variables and predictions

📌THIS IS VALUABLE FOR….

Scenario Evaluation
Validation of Your Model’s Correct Performance
Determining Variable Importance
Detailing how the value of some variables modify the predictions of my model.

Being able to explain the predictions that your ML model is giving is necessary and very effective to carry out a complete work, throughout this article I have explained different methods that can be used to address this problem. For this, I have talked about both global methods, these allow us to observe which of our variables are more discriminant in our model, whether they are doing so negatively or positively with respect to the target variable (default) and how small variations (simulations) in the data modify the prediction of my model, and local models which allows you to see in specific and real rows of data what makes the model decide one way or another.

Do you identify as Latinx and are working in artificial intelligence or know someone who is Latinx and is working in artificial intelligence?

Get listed on our directory and become a member of our member’s forum: https://forum.latinxinai.org/
Become a writer for the LatinX in AI Publication by emailing us at publication@latinxinai.org
Learn more on our website: http://www.latinxinai.org/

Don’t forget to hit the 👏 below to help support our community — it means a lot!