a fresco on a ceiling of an image of justice personified holding a sword surrounded by cherubs looking on. the image is surrounded by gold leaf details — Justice from the Stanza Della Segnatura frescoes in the Vatican Museum by Raphael.

The role of data scientists in building fairer machine learning — Part 2

Published in

IBM Data Science in Practice

8 min readFeb 22, 2022

Part 1 of this blog series discusses how data scientist can contribute to developing fairer machine learning (ML). This blog post elaborates further on quantitative and qualitative frameworks that help practitioners to substantiate big concepts such as group fairness. Special focus is put on training ML models that exclude protected attributes and accuracy-fairness trade-offs.

Coding examples are provided for the discussed topics, for which IBM’s open-source toolkit AIF360 is used. Please note that other toolkits to detect and mitigate bias are developed by Amazon (SageMaker Clarify), Google (WhatIf) and Microsoft (Fairlearn).

Case study

We continue with the following case study: deploying an ML risk assessment tool to predict defaulting loan applicants at a bank. We will examine how to deal with observed disparities in the data (German Credit data set), i.e., the hypothesis that young loan applicants (age≤25) default more often than elder applicants (age>25).

Recall the conclusion of Part 1 of this blog series: The US Consumer Credit Protection Act prohibits loan approval to be based on the age of applicants. Including age as an input variable in a predictive risk assessment model that supports decisions for loan approval, is therefore illegal.

In this blog, strategies are discussed to develop ML models that exclude protected attributes, such as age. We show how quantitative fairness metrics can help to select the ‘best’ ML model.

the pipeline of computing fairness metrics: first, in pre-processing looking at outcome in data, Y, observed data; second in-processing looking at predicted outcome for model validation, Y-hat; third, post-processing looking at predicted outcome on test set, Y-hat

ML workflow — training
The full ML workflow is depicted in Figure 1. We start with implementing the case study in AIF360 and we split the original data set in a training-, validation- and test sets (see code below).

# age as protected attribute 
prot_attr = 'age' 
age_level = 25  # (un)priviliged groups 
privileged_groups = [{'age': 1}] 
unprivileged_groups = [{'age': 0}]  # pre-processing data set AIF360 
gd = GermanDataset(        
  
    # specify protected attribute     
    protected_attribute_names=[prot_attr],          # initialize priviliged class
    privileged_classes=[lambda x: x > age_level],    # default pre-processing
    custom_preprocessing=default_preprocessing 
)  # split data 
gd_train, gd_val, gd_test = gd.split([0.5, 0.8], shuffle=True)

For our case study, we use a basic binary logistic regression (LR) classifier to predict defaulting. Two strategies to exclude the protected attribute age from the ML model are discussed:

Excluding protected attributes before training: An intuitive strategy is to remove protected variables from the data before the data are fed to the ML model. In the below code age is removed from the training data. The LR model is trained on the remaining 56 out of the 57 features in the German Credit data set.

gd_train1 = gd_train.copy()  # delete age column from features 
gd_train1.features = np.delete(gd_train1.features, 4, 1)# initialize pipeline 
model1 = make_pipeline(StandardScaler(), LogisticRegression(solver='liblinear', random_state=1))  # model parameters 
fit_params = {'logisticregression__sample_weight': gd_train1.instance_weights}  # fit model 
LR1 = model1.fit(gd_train1.features, gd_train1.labels.ravel(), **fit_params)

A ML model can however discriminate even when protected characteristics are excluded from the data. For instance, if age holds predictive power to predict defaulting, but is removed from the data, the ML model could assign more weight to proxy-variables for age by increasing the regression coefficients of these variables. Proxy-variables are variables that indirectly inform you about the age of an applicant, due to a close correlation between age and its proxy-variable. In the German Credit data set, proxy-variables for age are for instance years of employment and savings, as elderly tend to have more savings than youth.

So, although direct discrimination is prevented by excluding protected attributes, indirect discriminatory behavior is still possible. The scenario of discrimination through proxy-variables can be resolved by the following strategy:

2. Change model parameters for protected attributes after having trained the ML model: include the protected attributes in the training phase of the ML model. Now, age ‘absorbs’ all of its predictive power and prevents that its predictive power is relocated to proxy-variables. Once the model is trained, we set the regression coefficients of the protected variable manually to 0.

Note that Strategy 2 is only applicable to interpretable classifiers, such as logistic regression and simple decision trees, but not to non-interpretable classifiers, such as random forests, XGboost and support vector machines.

Based on quantitative fairness measures as introduced in Part 1 of this blog series, we examine the effect of Strategy 1 and Strategy 2 on (un)equal treatment of the unprivileged class (age≤25) and privileged class (age>25) by the LR model.

ML workflow — validation

Once the model is trained, the validation set is used to select a suitable classification threshold for the LR model. It is important to note that the classification threshold is the decision-making mechanism to map loan applicants, based on the probabilities as returned by the logistic function, to the favorable class (no default) or the unfavorable class (default). For example, if the classification threshold is set to 0.8 and the LR model predicts a loan applicant to default with probability 0.75, the applicant is mapped to the favorable class (no default), an applicant with predicted defaulting probability 0.82 is mapped to the unfavorable class (default). As we will see, threshold policies play an important role in fairer ML.

Commonly, from a sequence of candidate thresholds the threshold is chosen that returns the highest model accuracy on the validation set. This threshold is then used to produce results for the ML model on the test set. In the below code, performance of the LR model is computed for 50 classification thresholds between 0.5 and 1. The helper function test is used to compute the model performance of the LR model for all candidate thresholds. The full code can be found in this Github repository.

# specify candidate threshold
thresh_arr = np.linspace(0.5, 1, 50)

# compute model metrics
metrics1 = test_model(dataset=gd_val,
                      coef=coef1,
                      intercept=intercept1,
                      thresh_arr=thresh_arr,
                      unprivileged_groups=unprivileged_groups,
                      privileged_groups=privileged_groups)

Figure 2 shows that thresholds 0.78 (Strategy 1) and 0.81 (Strategy 2) yield the highest model accuracy for the trained LR models on the validation set. Note that balanced accuracy is used here to assess the model performance, since the target variable (default or not) in the German Credit data set is imbalanced (70:30). For imbalanced target variables balanced accuracy is a more appropriate model performance metric than, for instance, precision or recall.

a graph showing balanced accuracy on the y-axis with values between 0.50 and 0.80 and classification thresholds on the x-axis with values between 0.5 and 1.0. Strategy 1 and Strategy 2 begin with a balanced accuracy score of just above 0.65 and hit their maximal balanced accuracy score (strategy 1 with a balanced accuracy of 0.75 and strategy 2 one of 0.72) at a classification threshold just shy of 0.8

Accuracy-fairness trade-off

When contemplating strategies to deal with protected attributes and fairness, the choice for the ‘best’ exclusion strategy or classification threshold does not merely depend on accuracy metrics. Fairness measures join these considerations as well. We discuss how practitioners can deal with trade-offs between model accuracy and fairness.

Recall the two fairness measures as introduced in Part 1 of this blog series: statistical parity difference and the disparate impact ratio. Based on the observed outcomes in the original data, we found a statistical parity difference of 0.10 and a disparate impact of 0.86. For statistical parity difference and disparate impact respectively, a fairness measures score closer to 0 and 1 is considered to be ‘fairer’.

Note that fairness metrics are now based on LR model predictions and therefore depend on the chosen classification threshold, since the chosen threshold decides how many loan applicants are mapped to the favorable and unfavorable class. The two measures are plotted for both strategies in Figure 3.

Given the above accuracy and fairness scores, we examine which strategy we prefer for this case study to exclude protected attributes from the data. Determining the ‘best’ exclusion strategy is however not a quantitative optimization problem, i.e., maximizing model accuracy and minimizing fairness scores. Instead, qualitative reasoning again joins the fairness equation: what amount of accuracy are the model owners willing to offer to improve fairness scores? Given the goal of the ML model, are some fairness measures considered more important than other fairness measures?

Settling the accuracy-fairness trade-off

There is no universally right answer to the above questions. As discussed in Part 1 of this blog series: fairness is a normative and context-dependent concept that is primarily driven by values and believes rather than objective (quantitative) ground truths. Ideally, a diverse group of stakeholders make normative decisions together at a case-by-case basis through discussion. The role of data scientists is to inform the audience with quantitative insights, for example by creating accuracy-fairness plots as Figure 3.

Let’s assess how accuracy and fairness are interconnected in our case study. From Figure 2, it becomes clear that Strategy 2 outperforms Strategy 1 in terms of model accuracy. On the other hand, Figure 3 shows that Strategy 1 outperforms Strategy 2 in terms of fairness metrics. For Strategy 1, the threshold 0.78 that maximizes model performance on the validation set (0.73) yields a statistical parity difference of 0.15 and disparate impact score of 0.77. For Strategy 2, no reasonable classification threshold returns a statistical parity difference score less than 0.22 and no disparate impact score above 0.72. Since the Credit Protection Act urges us to err strongly on the side of equal treatment of age groups, one could argue to prefer Strategy 1 over Strategy 2 to exclude the protected attribute from the ML tool.

Besides, note that fairness metrics in Strategy 1 can be improved further. When threshold 0.82 is selected, a bit of accuracy is offered (0.72) in return for a higher statistical parity difference score (0.10) and higher disparate impact score (0.85). These fairness scores almost equal the observed disparities in the original data set. Of course, perfect fairness scores are the holy grail, but are (just as perfect model accuracy) not a realistic goal to aim for. Assessing where to settle the accuracy-fairness trade-off will therefore always be a normative exercise.

It is important to note that above computed fairness metrics depend on the split of the original data set into train-, validate-, and test sets. Fairness metrics should therefore be recomputed on different splits of the original data set to gain summary statistics about fairness metrics such as the mean, the median, and variance, among others, before a final exclusion strategy and classification threshold is chosen.

Moreover, in this tutorial, we focus primarily on computing fairness measures, whereas training other ML methods might improve the model accuracy. Different model performance leads to different dynamics in the trade-off between accuracy and fairness but could compromise the interpretability of the ML model as well.

Conclusion
No standardized procedures exist yet to handle protected attributes and its proxies in ML modeling. Choices about model accuracy and fairness are therefore driven by values and believes rather than quantitative ground truths. Settling accuracy-fairness trade-offs has always been implicitly present in human decision-making. The data-driven algorithmic era has just brought those trade-offs to the foreground and encourages us to reason about those trade-offs more precisely. It is up to data scientists to bring quantitative insights about these trade-offs to the table to make normative modeling choices together.

Interested in learning more about the IBM Data Science Community? Join here and please leave a comment or a direct message if you have any questions or ask for a particular article subject to cover next time.

The role of data scientists in building fairer machine learning — Part 2

Case study

ML workflow — validation

Accuracy-fairness trade-off

Settling the accuracy-fairness trade-off

Written by Jurriaan Parie