Ensuring fairness and explainability in credit default risk modeling (SHAP, IBM Toolkit AIX 360, AIF 360)

(Tony) Junhong Xu
24 min readDec 14, 2019

Catherine Yu Miao, Karan Palsani, Michael Sparkman, Jenny Tseng, Junhong (Tony) Xu

Code used for this project can be found in this GitHub link

Current black box situation of the loan application process. Credit: sigularityhub.com

Financial institutions are increasingly focusing on Machine Learning to facilitate their various decision-making processes. A well-known example is to assess consumers’ default risks with predictive models, which in turn help determine whether to issue loans. These models generally have no answer to the question: “If rejected, then why?” This is the black box the image above is referring to and the question we would like to address.

Introduction and Background

Many people struggle to get loans due to a variety of unknown reasons. Quite often, they are left wondering what went wrong. Home Credit strives to broaden financial inclusion for the unbanked population by providing a positive and safe borrowing experience. In order to make sure this under-served population has a positive loan experience, Home Credit makes use of a variety of alternative data — including telco and transactional information — to predict their clients’ repayment abilities. The Home Credit dataset was provided in a Kaggle competition.

Leveraging the data, our goal is to predict default instances; while ensuring that if a client is rejected, we are capable of demonstrating the decision was fair and providing underlying reasons. Specifically, we aim to utilize FATML (Fairness, Accountability, Transparency in Machine Learning) principles in predicting home loan default propensity.

There is a fine divide for the person/machine deciding to reject a loan or not but it makes a world of difference for the applicant at times. Credit: HBR.org

We plan to incorporate mechanisms ensuring fairness and explainability into the standard machine learning process in our approach to the problem:

  • Dataset preprocessing: correlation analysis, outlier detection, null value treatment, feature engineering, measuring disparate impact of protected attributes (to ensure training dataset is fair), balancing dataset
  • Training traditional ML models: hyperparameter tuning, threshold optimization; e.g., logistic regression, Random Forest, XGBoosting
  • Training explainable models: BRCG and GLRM (as advised by a distinguished data scientist at IBM)
  • Explaining traditional ML models: SHAP with Random Forest and XGBoosting

Note: While there are three ways to ensure fairness — pre-processing, in-processing, and post-processing — we mainly relied on pre-processing while examining the fairness of our models. As can be seen later, our dataset does not actually need much adjustment to ensure fairness.

Data Description

The Kaggle competition provides seven data tables containing home loan applicant information. We leveraged six of them for our modeling efforts after considering the usefulness of each table:

  • Application (308k x 122): Main application table containing data for each applicant
  • Bureau (1.72m x 17): Previous credits with other financial institutions for select applicants
  • Credit_card_balance (3.84m x 23): Monthly balances of previous credit cards for select applicants
  • Installments_payments (13.6m x 8): Repayment history for previous Home Credit loans for select applicants
  • POS_CASH_balance (10m x 8): Monthly balances of previous Home Credit point of sales (POS) and cash loans for select applicants
  • Previous_application (1.67m x 37): All previous applications for Home Credit loans, if any, for every applicant

After examination and summarization of each supplemental table, we merged them with the main Application table to obtain an intermediate dataset with 307,511 rows x 138 columns (prior to codifying categorical columns).

Data Pre-Processing, and Exploration

To start out, we conducted EDA on the main application table:

  • Identified potentially discriminatory features, i.e., protected attributes, in the dataset to be age (“DATE_BIRTH”), gender (“CODE_GENDER”), and marital status (“NAME_FAMILY_STATUS”), based on US laws
  • Detected outliers with the .describe() function and histogram plotting, treated as deemed appropriate
  • Handled missing values as appropriate
  • To start out, we identified the missing value percentage in each column:
Missing value percentages on training data
  • We then analyzed the correlation between each pair of columns to identify potential columns (those with high percentages of NAs and high correlation with columns that do not have many missing values) to drop
  • Finally, we inspected the remaining columns to determine the optimal way to fill the missing values of each column (including mean, median, mode, 0’s, and “special cases” for categorical columns that required research); the example below demonstrates how we filled the NAs for special cases

Special Feature Engineering

We then conducted EDA for each of the supplemental tables in a similar manner. Additionally, we identified the useful information from each table, summarized them appropriately, then merge them with the main table. As an example, from the installments_payments table, we determined that the percentage each applicant owes with regard to his or her total installment amount is important. As such:

  • We grouped the table by the applicant ID (“SK_ID_CURR”) and calculated necessary aggregated information:
  • We then used the aggregated information to calculate each applicant’s percentage owed:

After analyzing and summarizing the five tables, we merged them with the main table for a combined total of 258 predictors. For any applicants who did not have information from the supplement tables, we filled their null values with zeros. We believed that this was a sensible approach. E.g., if an applicant did not have credit card information, s/he likely never received a credit; therefore should have no “days past due” with respect to credit card balances.

Next, we checked whether bias exists in the three protected attributes of the training set, as unfair data would likely lead to unfair models. To expand on an earlier point, for our default risk modeling efforts, being fair means our models cannot systematically indicate an applicant would default based on the following attributes:

  • CODE_GENDER (gender of the applicant; female or male)
  • DAYS_BIRTH (how many days ago the applicant was born; i.e., birthday)
  • NAME_FAMILY_STATUS (marital status of the applicant; single, civilly married, married, separated, or widowed)

We leveraged IBM’s AIF 360 tool to check potential biases. To help verify our dataset does not discriminate based on the above attributes, the tool offers various fairness metrics for use. Regardless of the metric choice, a protected attribute needs to be identified and divided into a binary variable; and the privileged class needs to be specified. In our case, we determined the privileged and the unprivileged groups for each protected attribute based on the percentages of unfavorable outcome (i.e., default) in the training set. That is, the group with a lower percentage is deemed privileged, the other unprivileged. The breakdown is as follows:

Following IBM’s recommendation, we determined each attribute’s disparate impact, or the probability of a favorable outcome for unprivileged instances divided by the probability of a favorable outcome for privileged instances, after binarizing the attributes. Below is some code sample for one of our calculations:

As a quick note, the ideal value of disparate impact, one, indicates perfect fairness between the two groups. A value less than 0 implies a higher benefit for the privileged group, while a value greater than 1 implies a higher benefit for the unprivileged group. The convention is that values between 0.8 and 1.2 indicate fairness. The three attributes have the following measures:

  • CODE_GENDER: 0.9663
  • DAYS_BIRTH: 0.9573
  • NAME_FAMILY_STATUS: 0.9924

As one can see, the disparity between the privileged and the unprivileged group is not severe for any of the attributes. This indicates that the data for our subsequent explainable modeling is fair, and no pre-processing of data (with respect to fairness, e.g., reweighing) needs to be done.

As the last preprocessing step, we prepared a SMOTED (SMOTE: Synthetic Minority Over-sampling TEchnique) dataset. The Kaggle dataset has an imbalanced target class, as 92% of the applicants did not default. While we planned to try varying thresholds for each classifier, we prepared a SMOTED dataset as an alternative to combat the imbalanced classes. Below is the code snippet to prepare the SMOTED dataset:

Learning and Modeling:

We attempted to build various models, both traditional ML ones and those in the IBM’s AIX Toolkit, to compare AUROC and explainability. The models we chose to build and their respective rationale are

  • Logistic Regression: Efficient and explainable model that offers a baseline AUROC to compare with
  • Random Forest: Traditional performant model that needs to be supplemented with tools such as SHAP for explainability
  • XGBoost: Modern performant model that needs to be supplemented with tools for explainability
  • BRCG: An IBM explainable classifier providing succinct, explainable rules for predictions
  • GLRM: An IBM explainable classifier providing succinct, explainable rules with coefficients for predictions

As mentioned earlier, both BRCG and GLRM were recommended for our specific use case by a distinguished data scientist at IBM.

To note, given the limitations of our PC’s computing power, we were unable to train XGBoost and the two IBM classifiers (the latter due to the need to binarize all columns) with all 258 features. Therefore, we used a Random Forest model to identify the top 20 features in terms of feature importance for the three models’ use. We experimented with different numbers of features (up to 50) and discerned no significant improvement beyond 20 for logistic regressions or Random Forest’s AUROCs or F1 scores. Next, we dive into parameter selection for the Random Forest and the XGBoosting models.

Given that we have an imbalanced dataset, we also leveraged Logistic Regression and Random Forest’s class_weight parameter to see whether model performance improves relative to an unbalanced model.

Lastly, we trained two logistic regression models and two random forest models (one with all features, the other with the top 20 features) with the SMOTED dataset to see which balancing method would produce better results. Also importantly, the two IBM classifiers could only produce meaningful results with the SMOTED dataset, so these models provide a benchmark for comparing the IBM tools.

Parameter Tuning

Since the primary focus of our project is not on obtaining the best prediction performance, rather on exploring the explainability and fairness of a model, we did not devote a great deal of effort to optimizing parameters. Instead, we opted to choose parameters that give us reasonable performance.

For the Random Forest models, the only parameter that we tuned was the number of trees (n_estimators). We increased the number of trees grown from the default (100) to 200 and observed minor improvement. As an example, the AUROC increased from 0.69 to 0.74 for the all-feature, unbalanced Random Forest model. Beyond 200 trees, while training set metrics continued to improve, we did not see any improvement on the test set metrics. As we don’t want to overfit our model and increase unnecessary training time, 200 trees were used for all Random Forest models.

Parameter gird and best estimator for the XGBoost model

For the XGBoost model, we ran our model based on the top 20 features as mentioned. The untuned XGBoost model had significantly poorer performance than the all-feature, unbalanced Random Forest model. This prompted us to use GridSearch to find the optimal parameters. Since training XGBoost models is time-consuming, we used RandomizedSearchCV() function to randomly select a number of combinations of parameters specified in the parameter grid above (left). The parameters selected by the three-fold cross-validation are also shown above (right). This set of parameters gives us an AUROC of 0.75, the highest of all models. Since the parameters are cross-validated, the chances of the model over- or under-fitting is reduced.

After training the logistic regression, Random Forest, and XGBoost models, we also estimated the optimal threshold for each model to achieve the best possible F1 scores. The code snippet below illustrates the function we used for said selection. In general, the unbalanced models needed updated thresholds (i.e., not 0.5). Surprisingly, the Random Forest-balanced and SMOTED models also required new thresholds. For all unbalanced models (logistic regression, Random Forest, and XGBoost) and the balanced Random Forest models, the optimal thresholds were between 0.08 and 0.10, similar to the 8% of defaulted applicants. The SMOTED Random Forest models optimized around a threshold of 0.2.

Code for selecting the optimal threshold

We next dive into our modeling efforts for the two IBM models: BRCG and GLRM. We also described the results for the two models in the upcoming section.

BRCG (Boolean Rules Column Generation)

As alluded earlier, we used the top 20 features selected from the Random Forest model, then SMOTED the resulting dataset to fix our class imbalance problem. While the unbalanced traditional models were able to perform well, the IBM models would predict everything as the dominant class (i.e., not default) with no rulesets without the SMOTED dataset. We theorized this might be due to accuracy being the cost function relied upon by the IBM tools.

First, we started by binarizing our columns using IBM’s binarizer tool. Next, these features were passed through the BRCG function. The function takes several parameters that include:

  • Restriction λ for how long the rule can be
  • Restriction λ for how many rules can be in the ruleset
  • Whether or not to use the disjunctive normal form or conjunctive normal form rules

We chose to stick with the default restriction parameters and used disjunctive normal form rules as this was advised in the IBM AIX360 GitHub documents.

Left: FeatureBinarizer code; Right; BRCG goes through several iterations of solving the subproblem
The output is a very simple two rule ruleset

The way one would read the ruleset for our problem is: “If the first rule or the second rule is true, then classify as high risk for default.” It is clear how explainable the ruleset is. One drawback of BRCG is that it only produces predictions, not posterior probabilities, making it difficult to compare with other models for further tuning.

GLRM (General Logistic Rules Model)

Once again, we started with binarizing our features as GLRM only accepts binarized features also. We then fit the GLRM model with 2000 iterations and the default parameters for the rule penalization. We were then able to calculate many different testing metrics but decided to focus on AUROC and F1 scores to keep the models comparable. Next, we examined the rules and coefficients generated from our model. As we can see below, the GLRM came up with four rules that are already in the order of importance based on their coefficients’ magnitudes. We have two first-degree rules, rules 2 and 3, that we can use to create GAM (General Additive Model) visuals, with all the other rules being of higher degrees. One interesting aspect is again, only Ext Sources 2 and 3 were used with no ordinal features being considered. Ordinal features would be defined as just feature names with no signs and can just be interpreted as a normal logistic variable. So, what this implies is that a reasonable model can be made just using rules based around Ext Sources 2 and 3. From the visuals below, we can see that the model puts a lot of emphasis on having an Ext Source 3 below 0.51 and Ext Source 2 below 0.55 for predicting that the individual is likely to default.

Left: GLRM Code; Right: GLRM ruleset with coefficients
GAM visuals showing EXT_SOURCE_3 and EXT_SOURCE_2

For those interested in more details of BRCG and GLRM, we included a bonus section below the reference list on the bottom of the blog post. We also provided reference links to IBM’s information in the corresponding section.

Results

Model Performance

The table below shows the AUROC and the F1 score each model obtained. The cells highlighted in light green on each row indicate the best performance for the specific model with the number of features in the parenthesis. The XGBoost model, trained on the top 20 features (highlighted in dark green), yields the best overall result.

Model performances with different feature sets

While BRCG and GLRM are not the most performant, the F1 scores are rather comparable. Additionally, as explained, both of them are incredibly explainable. Moreover, with External Sources 2 and 3 being the most important features, we can even say they are fair models. As can be seen from the table below, neither of the external sources is correlated with gender or marital status; and they are only weakly correlated with age.

Table of correlations for external source 2 & 3 and protected attributions

While the two IBM tools achieve fairness and explainability, we would like to further explore whether we can attain the same objectives with Random Forest and XGBoost, both of which outperform the IBM models in terms of AUROC and/or F1 scores, using SHAP.

SHAP

Feature Importance comparison

Feature importance of Random Forest (left) and XGBoost (right) using the top 20 variables

The original feature importance of the Random Forest model (above left) tells us that the top important features in our model are EXT_SOURCE_2, following by EXT_SOURCE_3 and DAYS_BIRTH. When we look at the same plot for the XGBoost model (above right), that ranking becomes EXT_SOURCE_3, EXT_SOURCE_2, EXT_SOURCE_1. Based on the feature importance plots, all the external source variables seem to be important for both models, except for EXT_SOURCE_1 in the Random Forest model, which is actually ranked as the 19th most important variable.

For both Random Forest and XGBoost models, the default option for calculating feature importance is by weight. The weight option looks at how many times a feature is used to split in a tree. In an ensemble model, this is calculated by taking an average of all the trees.

Feature importance (SHAP) of Random Forest (left) and XGBoost (right) using the top 20 features

Above is the “feature importance” plot for the two models using the SHAP package. For these two plots, the features are ranked by averaging their absolute SHAP values. By looking at the top three features for the Random Forest model, we can immediately see the difference between this and the original feature importance plots that comes with the sklearn package. For the Random Forest model, the top 2 variables are EXT_SOURCE_2 and EXT_SOURCE_3, which are the same as the original. However, the importance ranking of the variables changes dramatically after that. The third variable now becomes EXT_SOURCE_1, which is previously the 19th most important variable. The magnitude in importance for the not-so-important variables changed too. The original plot shows a graduate decrease in importance, but for SHAP, there is a large drop in importance after the first two variables. Surprisingly, feature importance for the XGBoost model didn’t change that much from the original plot to the SHAP plot.

Force plot for Applicant 32. Top: Random Forest. Bottom: XGBoost

One of the biggest advantages of using SHAP is that it offers local interpretability to complex ML models like Random Forest and XGBoost. Here we see an example of one of the applicants in our training set. Both Random Forest and XGBoost models predicted that the person will default. The vertical length of the red and blue bars show the magnitude of each feature’s contribution to the outcome. This type of local explanation has great value for financial companies such as Home Credit, since the criteria that the model used to reject or accept the loan application are shown. Companies can now offer their customers an explanation for loan rejections and help them improve their credibility in the future.

Summary plot for Random Forest (left) and XGBoost (right) model

In the summary plot for SHAP values, there are a few important pieces of information shown:

  1. Feature value. The magnitude of the feature values are shown on a red-blue color spectrum. Red indicates larger feature value, and blue indicates lower feature value.
  2. SHAP value. The SHAP values associated with each feature is plotted in a violin plot. Each dot in the plot indicates a SHAP value. A positive SHAP value means a positive contribution to the predicted outcome (default on loan) and a negative SHAP value indicates a negative contribution.

Looking at the graph, some features seem to be monotonic. For instance, SHAP values for the external source variables increase as feature values decrease. This shows that the model is predicting a higher risk of default when the external source variables are small, and vice versa. A similar monotonic trend can be observed for the DAYS_BIRTH variable in the Random Forest model, younger applicants generally have positive SHAP values and older applicants have negative SHAP values (Note: the DAYS_BIRTH variable is inverted in our model; red means younger and blue means older). However, some features are not so monotonic. For DAYS_BIRTH variable in the XGBoost model, we can see that the left tip of the violin plot has a few red dots. This indicates in those instances, these applicants’ young age actually contributes to the prediction of them not defaulting.

Dependency plots of DAYS_BIRTH and DAYS_EMPLOYED (RF: left; XGBoost: right)

To investigate further how the two models use “DAYS_BIRTH” variable in their prediction, we also looked at some interaction effects between age and other variables. In this example, we looked at the interaction between “DAYS_BIRTH” and “DAYS_EMPLOYED.” For the Random Forest model, without looking at the color coding, we can again see the monotonic linear trend that we observed in the summary plot: As age increases, the model predicts less risk and vice versa. When we add in the interaction layer, we can see that older people tend to have more total employment days, and it makes sense for our model to assume lower risk.

For the XGBoost model, the same interaction effect is similar but more volatile. From the beginning, we can see that there is an interesting inverted U-shape distribution for the SHAP values. The XGBoost model doesn’t seem to like middle-age applicants as SHAP value peaks around people who are 41 years old. The U-shaped distribution looks somewhat symmetrical, and the SHAP values in the two ends are both negative, indicating that younger and older applicants’ ages are contributing similarly to predicted outcomes. In terms of DAYS_EMPLOYED, the model again cautioned against applicants with fewer employment days; however, the XGBoost model identifies middle-aged, short working history applicants as the riskiest group of customers.

From our analysis of the age variable (DAYS_BIRTH), we can see that both the Random Forest and the XGBoost model display some discriminatory behavior based on age. The Random Forest model shows a monotonic, linear trend of predicting higher risk as age decreases. The XGBoost model, on the other hand, does not show a monotonic trend but is discriminating against middle age applicants.

So, which one is better? We obviously don’t want to have a monotonic trend against everyone who is younger and favors everyone who is older because this would be blatant age discrimination. In addition, this type of monotonic trend also decreases the performance of the model as some younger applicants are probably as capable of repaying the loan as the older folks (this is evident in our model performance as XGBoost model has a higher AUROC than the Random Forest model). However, is our XGBoost model truly better? Is discrimination against middle age applicants better than uniform age-based discrimination?

Conclusion

In our efforts to leverage FATML in building explainable and fair models that predict credit default risk, we trained the following classifiers after ensuring our training data is fair against all protected attributes (Age, Gender, Marital Status): Logistic regression, Random Forest, XGBoost, BRCG, and GLRM.

Notably, the XGBoost model yields the best AUROC and F1 score. Moreover, using SHAP, we can bring interpretability to the decision of every application, which can be used by a loan officer when explaining to a rejected applicant. SHAP also allows data scientists to oversee the model on a global level to identify potential biases. That said, the XGBoost model discriminates against middle-aged applicants; as such, our fairness objective is not fulfilled.

BRCG and GLRM, while having slightly lower AUROC (GLRM only) and F1 scores, are extremely explainable with only two and four rules, respectively. Importantly, with the two features they leverage, External Sources 2 and 3, the two models are also fair. Therefore, with regard to our two objectives, explainability and fairness, BRCG and GLRM are the two best models we trained (with GLRM slightly more preferable than BRCG, since it also offers posterior probabilities).

Lessons Learned

  • There is a tradeoff between how directly interpretable a model is vs. its performance: Logistic regression, BRCG, and GLRM are all interpretable models; and all had worse AUROCs and F1 scores compared to Random Forest and XGBoost
  • While Random Forest and XGBoost by themselves are unexplainable, post-hoc explainability tools like SHAP are very powerful in increasing their transparency
  • Calculating SHAP values can be extremely time-consuming for certain models (e.g., Random Forest), while astonishingly fast for others (e.g., XGBoost)
  • Ensuring fairness is a multi-step process; from ensuring the training data is fair, having a fair classifier, to eventually changing predictions if necessary

Future Work

While we delved into fairness modeling as much as we could given the time allowed, we do feel that the scope for the fairness modeling could be expanded in the future. Specifically, we could try

  1. Reweighing the training set with respect to DAYS_BIRTH: While all three protected attributes had relatively fair outcomes in the training data, DAYS_BIRTH had the most skewed ratio between the unprivileged and the privileged classes. We would like to observe how reweighing the training set affects the subsequent modeling; specifically, would XGBoost continue to discriminate against middle-aged applicants?
  2. Training and tuning a fairness classifier using a model provided by AIF360, such as PrejudiceRemover. The snippet below shows our current attempt in training such a classifier. As can be seen from the metrics on the bottom, the model is extremely fair; that said, it only has a 50% accuracy, indicating the necessity to tune. As an FYI: metrics that help measure the fairness of a classifier’s results include statistical parity difference, equal opportunity difference, and average odds difference, among other things. All three metrics mentioned have an ideal value of zero, with anything between -0.1 and 0.1 indicating fairness. To understand all the metrics and when to use which, see the guide provided by IBM.
  3. Change uncertain XGBoost predictions by editing posteriors to satisfy the fairness constraints. This would ensure the best of both worlds (explainability and fairness) for XGBoost, which would become our recommended model.
PrejudiceRemover (fairness classifier) modeling results

For more of our experiences with AIF360, see the Bonus section below the references below.

If you’d like to take a detailed look into our team’s code, here’s the GitHub link. Please feel free to provide any feedback or comments.

Reference

[1] IBM AIX360

https://github.com/IBM/AIX360 IBM AIX360

[2] IBM AIF360

https://github.com/IBM/AIF360

[3] BRCG & GLRM

https://github.com/IBM/AIX360/blob/master/examples/tutorials/HELOC.ipynb

[4] Boolean Decision Rules via Column Generation

https://arxiv.org/pdf/1805.09901.pdf

[5] Boolean-Decision Rules via Column Generation

https://nips.cc/media/Slides/nips/2018/517cd(06-09-45)-06-09-45-12722-Boolean_Decisio.pdf

[6] One Explanation Does Not Fit All: A Toolkit and Taxonomy of AI Explainability Techniques

https://arxiv.org/pdf/1909.03012.pdf

[7] An Introduction to Machine Learning Interpretability

https://pages.dataiku.com/hubfs/ML-interperatability.pdf

[8] Practical Techniques for Interpreting Machine Learning Models: Introductory Open Source Examples Using Python, H2O, and XGBoost

https://fatconference.org/static/tutorials/hall_interpretable18.pdf

[9] IBM AI Explainability and Fairness 360 Open Source Toolkits Explainability:

[10] Fairness:

[11] SHAP and LIME Python Libraries: Part 1 & 2

[12] SHAP Github

[13] Interpretable Machine Learning

[14] Hands-on Machine Learning Model Interpretation

https://towardsdatascience.com/explainable-artificial-intelligence-part-3-hands-on-machine-learning-model-interpretation-e8ebe5afc608

Bonus

Background on BRCG

Simple rulesets have often been looked over for more advanced machine learning techniques like logistic regression, decision trees, and even neural nets due to their weak power and time-consuming creation, often having to go through every combination of features to find the optimal ruleset. However, with the rise for the need of directly explainable machine learning models, rulesets have been given another look especially in IBM’s AIX360 toolkit. One particular tool is the Boolean Rules Column Generation (BRCG) model that makes use of disjunctive normal form and conjunctive normal form rules for binary classification.

As the name implies, BRCG uses column generation, an operations research algorithm that focuses on solving large linear programming, to overcome the problem of finding the optimal clauses for the ruleset. Column generation dates back to the 1950’s and 60’s, having been used to solve scheduling and, more famously, the cutting stock problem. The algorithm works by starting with a small subset of the features and splitting the problem into two. The first problem is the master problem, which is usually just the original problem and the second is the subproblem, which is used to find a new feature to add to the subset. Initially, the master problem is solved using the subset of features, with the resultant solution being used to generate the objective function of the subproblem. The subproblem is then solved with features not in the subset. If a feature with the most negative reduced cost can be identified, the feature will be added to the subset and the master problem is resolved. This cycle will continue until no more features with negative reduced cost can be identified and the solution will be considered the most optimal or at least close to optimal. There are variations of this method and IBM makes use a heuristic beam search to help with column generation.

BRCG uses binarized variables that it conjoins to form clauses. These clauses are then used as features for column generation, and only those that reduce the most cost are kept, making efficient and simple rulesets. Often times BRCG’s rule set will be the most explainable and simple model when compared to more classical methods such as decision trees and logistic regression but still powerful enough to not sacrifice too much in testing power thanks to column generations optimal solutions. Below is a demonstration of how we used BRCG to create a very simple rule set that has comparable F1 scores to our other models.

Background on GLRM

If a more accurate model is needed for classification, IBM recommends using a General Logistic Rules Model (GLRM). Just as the name sounds, it combines the linear elements from a logistic regression with conjunctive rulesets that are created, once again, from column generation. This provides a model that can compete directly with the more classical classifiers in terms of accuracy while also being simple enough to be interpreted directly with little explanation. The model can generate rules, just normal coefficients of the features, or a combination of both. These combinations can yield more accurate models than BRCG but at the expense of some explainability. Similar to BRCG, the GLRM has two different penalizing factors that can be used to restrict the length of the rules and how many rules are in the ruleset.

One interesting facet about GLRM is if the rules generated are first-degree rules (i.e. they have no interaction between different features) then the model can be visualized as a general additive model (GAM). These features can then be plotted to show how they affect the models output directly, adding a higher degree of explainability. Along these lines, GLRM, as with many other linear models, can also show feature importance, which combined with the first-degree plots and coefficients can help validate a professional’s underlying assumption about their problem.

SHAP

Left: Table of actual and predicted processing time; Right: Time vs. Rows of SHAP values calculated

We want to note one obstacle we ran into when trying to calculate SHAP values. When running the TreeExplainer algorithm in the python SHAP package on our Random Forest model, we found the computing took a long time. Initially, we tried to calculate SHAP value for all the rows in our data (about 300,000), but the algorithm fails to return an outcome after 4 days of continuous run time. We were curious about the actual time it takes to calculate SHAP value, so we did a few trials and the results are shown in the above table and graph. We found that it takes roughly 5.3 mins to calculate SHAP values for each row, and calculating all 300,000 rows will take over 18 days (26354 mins)! However, calculating SHAP values for the XGBoost model takes significantly less time than for the Random Forest model (we were able to compute SHAP value for 100,000 rows within 10 mins). At this moment, we don’t fully understand the reasoning behind the dramatic reduction in computing time. But for future projects, it would be interesting to get a better understanding of the differences in SHAP value processing time between different ML models since computation time is crucial when it comes to real-life implementation.

Experience with AIF 360

While AIF is a great tool to get our feet wet with regard to fairness modeling, some suggestions we have to improve the tool include

  • Adding the ability to include more than one protected attribute at a time (or better documenting said ability, if exist): It appears that for both reweighing and PrejudiceRemover, only one protected attribute can be treated at a time. For reweighing, we were unable to identify a way to reweigh a dataset with all three protected features. For PrejudiceRemover, the constraint is mentioned in the classifier’s source code as of December 7, 2019. Given our dataset included three protected features, if all three had high disparate impacts, three datasets would have to be prepared or three models trained to provide fair classification. This implies issues with these techniques’ reliability and scalability.
  • Enabling continuous designation of a protected attribute. The protected attributes currently need to be binarized. This is particularly problematic for Age, given that it is a continuous feature. That said, while identifying the most appropriate cutoff(s) may not be easy, determining an acceptable cutoff is not complicated.
  • Stabilizing the AIF 360 library or ensuring efficient import. Currently, importing the AIF 360 library can crash the Jupyter Notebook kernel depending on the computer (for our team, two team members were unable to import the library as the kernel would die upon import). This reduces the accessibility of the library greatly.

--

--