Exploring LIME (Local Interpretable Model-agnostic Explanations) — Part 2

Published in

Data And Beyond

9 min readJan 13, 2024

Much like squeezing lemons for a zesty taste, LIME extracts the essence of model predictions, offering a tangibly interpretable understanding from the complexity of AI.

This is a continuation series from Part 1. In Part 1, we covered the basics of LIME (how it is used, how it works, how it compares to SHAP, the limitations of the packages). For Part 2, I will be sharing some code on how to use a PySpark based dataset and model, on marcotcr’s LIME module. You may find Part 1 here. You may also find other resources from the navigational index here.

Background

I was tasked to use the lime package by marcotcr to explain the individual samples. The dataset was a pyspark.sql.DataFrame, and the model was from themmlspark.lightgbm module. It isn’t so direct, but there were wrappers around the code for reasons which are not relevant to our topic.

I will be focusing on TabularLIME instead of the other types of LIME.

One of the main limitations or problems I had with marcotcr’s LIME code was that it took in specific inputs and functions, which were based on NumPy arrays.

GitHub - marcotcr/lime: Lime: Explaining the predictions of any machine learning classifier

Lime: Explaining the predictions of any machine learning classifier - GitHub - marcotcr/lime: Lime: Explaining the…

github.com

Tutorial - continuous and categorical features

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini', max_depth=None, max_features='auto'…

marcotcr.github.io

We can see in his example, he creates an explainer which takes in:
1) The train dataset which is a numpy.ndarray with a shape of (no. of rows, no. of columns) eg. (120, 4) in the iris dataset.
2) The feature names which is a list of the feature / column names. In this case it is [‘sepal length (cm)’, ‘sepal width (cm)’, ‘petal length (cm)’, ‘petal width (cm)’]
3) The class names, which is actually the target labels. In this case is the type of flower. It takes in a numpy array with the shape of (no. of labels,). Since the target labels are array([‘setosa’, ‘versicolor’, virginica’], dtype=’<U10'), the shape is (3,)

After he creates the explainer, he will then use this explainer to fit any instance he wants to understand. The instance takes in:
1) One row of the test dataset. It is a numpy array with the shape of (no. of columns,). In this case it is (4,) since it has 4 features.
2) rf.predict_proba is the function by the model that returns the probability for each of the classes / labels. We know that if we run predict_proba on a row of the dataset, we will have to reshape it before it will return a shape of (1, no. of columns). Which in this case is (1,3). rf.predict_proba(test[19]) will return an error. We will need to use rf.predict_proba(test[19].reshape(1,-1)). Alternatively, if we run rf.predict_proba(test), it will work and return us a shape of (no. of rows, no. of columns). Which in this case is (30,3)

It is very important to know how these basic info as we will need to create a wrapper to load our data and model.

My data also had categorical features which meant that I had to encode the data accordingly thru my transformations. The code being used by marcotcr below has been deprecated so it won’t work. But understanding what happens in the code is sufficient to get yours to work similarly.

Basically, the categorical features were encoded and the explainer will take in a dictionary with the keys being the index of the categorical columns, and the values being a numpy array of the different distinct categories.

Running LIME

Lets cover the basics of how the code will be run.

I will use a wrapper module LIME_helper.py which will contain a class named XXX for simplicity sake. The row identifier that we will want to explain is ‘rowABC’. We will prepare the information above to be fed into the LIME Explainer.

from LIME_helper import XXX

import lime
import lime.lime_tabular
import numpy as np

# For Viewing Purposes used by the lime module
import matplotlib.pyplot as plt

# Pyspark related initiations not included here. SparkSession is initiated as spark.

xxx = XXX(spark)
class_array = np.array(['unlikely', 'likely'])
col_name_array, values_array = xxx.data_array('rowABC')
training_data_array = xxx.load_train_data_array('train_data_chunk_001.npy')
output_array = xxx.predict_proba(values_array)

The training_data_array is a numpy array extracted from the pyspark dataframe using a fraction method. This is because the pyspark dataframe was too big and would cause memory issues. To save processing time, we saved the converted training data chunk to be loaded everytime we want to run the explainer. We will go more into that detail later.

# Loading the LIME Explainer
tab_explainer = lime.lime_tabular.LimeTabularExplainer(training_data_array,
                                                       mode='classification',
                                                       feature_names=col_name_array,
                                                       class_names=class_array,
                                                       categorical_features=xxx.list_cat_feat,
                                                       categorical_names=xxx.dict_cat_names,
                                                       discretize_continuous=True,
                                                       kernel_width=None)

# Running the explain_instance on the individual instance we want to understand.
exp = tab_explainer.explain_instance(values_array, xxx.predict_proba,
                                     num_features=10, top_labels=2)

# Need to ensure matplotlib is loaded or will not show anything.
exp.show_in_notebook(show_table=True, show_all=False)

For the explain_instance(), i recommend that the top_labels include all the possible classes / labels. This would help with the mapping of the features later. If we only put top_labels=1, assuming there are 3 classes, it will only return the feature importance for the topmost label (eg. 1 class) only.

For exp.as_list(), the list length is as long as the num_features you set in the explain_instance.

# Returns a list of the feature names and conditions and their importance.
exp.as_list()

# Output as below.
[('feature_name' <= 0.00, 0.3821213), ('feature_name_2' <= condition, 0.12131)]

For exp.as_map(), the dictionary will return only 1 key if the top_labels=1. If the top_labels=2, it will return 2 keys. Within each key is a list with a tuple of the index of the column, and the weightage value. We would normally use this function, then change the index value to the actual column name for easy interpretability.

# Returns a clean dictionary of the label, the index of the column, and the importance.
exp.as_map()

# Output as below.
{1: [(34, 0.31313),
     (131, 0.21313),
     (1313, 0.012314)]

Now that we have covered how the code will run, lets go into the wrapper module which configures the data and module.

Wrapper Module

As the actual wrapper module i created is specific to the data and model being used, I will just share some of the issues i faced and transformation i did.

import pyspark.sql.functions as F
from pyspark.sql.types import IntegerType, DateType, StringType
from pyspark.sql import Window
from pprint import pprint   # Just for me to visualize during the construction phase.
from datetime import datetime, timedelta
import json

import os
import pandas as pd
import numpy as np
import copy
import sys

class XXX:
  def __init__(self, spark):
    self.model_metadata_path = 'xxxxxxx'
    self.model_object_path = 'xxxxxxx'
    self.spark = spark
    self.col_name_array = None
    self.values_array = None

    # dict_idx is a dictionary of categorical column names and their index numbers pairs.
    self.dict_idx = {'col_0':0, 'col_200':200, 'col_400':400}
    # dict_col_name is dict_idx but with the key/values interchanged
    # might be unnecessary, but I did it for clarity during in-process
    self.dict_col_name = {0:'col_0', 200:'col_200', 400:'col_400'}
    # list_cat_feat is a list of categorical column indexes.
    self.list_cat_feat = [0, 200, 400]

    # dict_cat_names is the corresponding column index with their distinct values.
    # Can be done in a simplified manner but I did this for clarity.
    self.dict_cat_names = {0:np.array(['a','b','c']),
                           200:np.array(['d','e','f']),
                           400:np.array(['g','h','i'])}

  def exec_prepare_data_for_apply(self, df_merged_data):
    '''Featurize the data.'''
    # featurize_df() won't be shared as it is project specific and not relevant.
    # It is just a function that featurizes the data before putting it in the model to transform.
    df_data_featurized = self.featurize_df(df_merged_data)
    keep_cols = ['features', 'rowIdentifier']
    df_data_compact = df_data_featurized.select(*keep_cols)

    return df_data_compact

  def main(self, sample_data):
    '''Load the sample data into the model which creates a prediction'''
    trained_model = # Code to load your model here.
    df_data_compact = self.exec_prepare_data_for_apply(sample_data)
    
    # This code can be any code that loads the data into the model and creates a prediction.
    df_final_score = trained_model.transform(df_data_compact)

    # Any code to transform or play around with the threshold / prediction should\
    # be done here too.

    # I transformed to ensure 'final_probability_1' and 'final_probability_0' columns
    # were present here to construct my predict_proba output later.
    
    return df_final_score

  def predict_proba(self, val_array):
    '''
    1) Takes in a value array(no. of columns,) or value array(no.of row, no. of columns),
    and returns a probability array(1,2)
    2) Takes in arrays with encoded categorical values that need to be changed back.
    '''
    # Assume number of columns is 2500.
    if val_array.shape == (2500,1) or val_array.shape==(2500,):
      temp_val_array = val_array.reshape(1,2500)
    else:
      temp_val_array = val_array

    temp_val_array = self.ori_cat(temp_val_array, 'cattoori')
    pdf = pd.DataFrame(temp_val_array, columns=self.col_name_array)

    df = self.spark.createDataFrame(pdf)
    df_score = self.main(df)
    array = df_score.selec('final_probability_0', 'final_probability_1').toPandas().to_numpy()

  def read_data(self):
    '''Load the data from storage'''
    path_to_data = 'yyyyyyyyy'
    merged_data = #input code to load your data from your storage here.
    
    return merged_data

  def sample_data(self, row_id):
    '''Load the sample data from the loaded data from the storage'''
    merged_data = self.read_data()
    if type(row_id)==str:
      df = merged_data.filter(F.col('rowIdentifier')==row_id)
    else:
      df = merged_data.filter(F.col('rowIdentifier').isin(row_id))
    
    df.persist()

    return df

  def data_array(self, row_id):
    '''
    Returns feature array and value array based on row_id.
    row_id can be a string, or a list of row_id
    '''
    sample_df = self.sample_data(row_id)
    sample_df_col_array = np.array(sample_df.columns)  # Returns array shape (2500,)
    
    sample_df_val_array = self.convert_to_string(sample_df).toPandas().to_numpy()
    
    # Condition to check for list or individual row_id
    if len(sample_df_val_array) == 1:
      sample_df_val_array = self.ori_cat(sample_df_val_array, mode='oritocat').reshape(2500,)
    else:
      sample_df_val_array = self.ori_cat(sample_df_val_array, mode='oritocat')

    self.col_name_array = sample_df_col_array
    self.values_array = sample_df_val_array
    
    return sample_df_col_array, sample_df_val_array

  def convert_to_string(self, df):
    '''Converts DateType to String to prevent toPandas() error'''
    target_types=[DateType()]
    for col_name in df.columns:
      col_type = df.schema[col_name].dataType
      if col_type in target_types:
        df = df.withColumn(col_name, df[col_name].cast(StringType()))

    return df

  def dict_creator_big(self):
    '''Returns a dict with prefix behind for conversion'''
    # dictoflists contains keys which are categorical columns
    # The values are a list of distinct values for those columns
    # You could code to get this list out. I am displaying it here for clarity.
    dictoflists = {'col_0': ['a','b','c'],
                   'col_200': ['d','e','f'],
                   'col_400': ['g','h','i']}
    prefix1 = '_oritocat'
    prefix2 = '_cattoori'
    conversion_dict = {}
    
    def dict_creator(listofitems):
      temp_dict = {}
      temp_dict_opp = {}
      for x in range(len(listofitems)):
        temp_dict[listofitems[x]] = x
        temp_dict_opp[x] = listofitems[x]
      return temp_dict, temp_dict_opp
    
    for col_name, distinct_list in dictoflists.items():
      oritocat_dict, cattoori_dict = dict_creator(distinct_list)
      conversion_dict[f'{col_name}{prefix1}'] = oritocat_dict
      conversion_dict[f'{col_name}{prefix2}'] = cattoori_dict

    return conversion_dict

  def create_dict_cat_names(self, makenew=False):
    '''Function to create dict for cateogircal_names function for LIME.'''
    if makenew == False:
      if self.dict_cat_name != None:
        print('Using existing dict_cat_names')
          return self.dict_cat_names
      else:
        print('self.dict_cat_names is None')

    newdict = {}
    for col_name, idx in self.dict_idx.items():
      conv_dict = self.dict_creator_big()[f'{col_name}_cattoori']
      temp_le_label_list = []
      for x in range(len(conv_dict)):
        temp_le_label_list.append(conv_dict[x])
        newdict[idx] = np.array(temp_le_label_list)

    self.dict_cat_names = newdict
    return newdict

  def ori_cat(self, array, mode='oritocat'):
    '''Two modes. oritocat or cattoori'''
    # The below code is to prevent type conversion error.
    # numpy object type can be changed to str / int etc.
    if array.dtype == np.object:
      newarray = copy.deepcopy(array)
    else:
      newarray = np.array(array, dtype=object)

    b = self.dict_creator_big()
    for col_name, idx in self.dict_idx.items():
      for row in newarray:
        ori = row[idx]
        cat = b[f'{col_name}_{mode}'][ori]
        row[idx] = cat
    
    return newarray

  def train_data_array(self, fractiontoextract: float, save_path: str):
    '''Convert training data to numpy array and save locally'''
    merged_data = self.read_data()
    sampled_train_data = merged_data.sample(fraction=fractiontoextract, seed=88)
    td_array = self.convert_to_string(sampled_train_data).toPandas().to_numpy()
    np.save(save_path, td_array)

    return td_array

  def load_train_data_array(self, array_path: str):
    '''Returns loaded numpy train data with categorical values converted to integers.'''
    td_array = np.load(array_path, allow_pickle=True)
    newarray = self.ori_cat(td_array, mode='oritocat')
    
    return newarray

The wrapper module is actually pretty simple. The slightly more complicated part would be the to adjust your code to match the label encoding method used by marcotcr’s code.

The above is an intuition and first draft that works. The code could probably be improvised to be much simpler and more adaptable to other pyspark dataframes.

Comment on MMLSpark.TabularLIME

I tried the mmlspark.explainers.TabularLIME package and despite it being relatively simple to fit and transform it lacked the flexibility that marcotcr’s code had. For instance, I could create a wrapper around the model and come out with a modified probability based on some algo for marcotcr’s LIME but the mmlspark version doesn’t.

I also had issues with my model having SparseVectors as the features whereas mmlspark.TabularLIME had issues with that and I had to change it to DenseVectors. There are also much much less examples of how to execute mmlspark.TabularLIME.

# Use the code snippet to convert SparseVector to DenseVector
from pyspark.ml.linalg import Vectors, VectorUDT, SparseVector, DenseVector
from pyspark.sql.functions import udf

# Create a udf to convert SparseVectors to DenseVectors
def sparsetodense(vector):
  return Vectors.dense(vector.toArray())

sparsetodense_udf = udf(sparsetodense, VectorUDT())

# Assume featurized dataset contains a column called 'feature' which is a SparseVector
df_featurized = df_featurized.withColumn('dense_vector', sparsetodense_udf(F.col('features')))\
                             .drop('features')\
                             .withColumnRenamed('dense_vector', 'features')

Wrap Up

We have reached the end of my exploration on LIME. I think there is still much more to be developed in this area as compared to the development in SHAP. I look forward to learning more as XAI progresses.