a fresco on a ceiling of an image of justice personified holding a sword surrounded by cherubs looking on. the image is surrounded by gold leaf details
Justice from the Stanza Della Segnatura frescoes in the Vatican Museum by Raphael.

The role of data scientists in building fairer machine learning – Part 1

Jurriaan Parie
IBM Data Science in Practice

--

Over the last few years, statistical measures to assess fairness in machine learning (ML) have gained prominence. Fairness, however, is a normative and context-dependent concept that defies simple quantitative definitions. To build fairer ML, practitioners therefore need new frameworks to substantiate big concepts such as fairness.

In this blog series, the role of data scientists in building fairer ML is discussed. Part 1 sheds light into quantitative and qualitative reasoning about group fairness. Part 2 of this blog series elaborates on training ML models that exclude protected attributes.

Coding examples are provided for the discussed topics, for which IBM’s open-source toolkit AIF360 is used. Please note that other toolkits to detect and mitigate bias are developed by Amazon (SageMaker Clarify), Google (WhatIf) and Microsoft (Fairlearn).

Rationale to deploy a ML model

First, a disclaimer: a key aspect of building fairer ML models is to design its lifecycle carefully. This IBM Medium blog post shares in greater detail what The Lifecycle View of Trustworthy AI involves. In the context of fairness, the conception of the ML lifecycle especially needs careful review.

In this blog series, we look at an ML tool that predicts defaulting of new loan applicants at a bank. Even before such a model is deployed, it is known that the model will never be fully accurate and will always generate wrong predictions. The lifecycle of the model should therefore start with the questions:

  • What are the (business) reasons for developing a ML model?
  • What are possible alternatives for the ML model?
  • What are the consequences of not using an algorithmic model at all?

Even before considering specific ethical risks, such as discriminatory effects of the ML tool, organizations need clear answers on those questions before developing a ML model. Fairness assessments should only be considered after ML applications have ‘proven’ their right to exist. The process of approving and rejecting the deployment of ML tools should be embedded in organizations’ model development policies. IBM is helping clients to design such policies, but a detailed review of those policies is beyond the scope of this blog post.

ML workflow

So, let’s consider a scenario where a predictive risk assessment tool has gained approval to be deployed at a bank. Where in the ML workflow can fairness be assessed? To digest this large question, we break down the ML workflow conceptually in three phases: 1) pre-processing, 2) in-processing and 3) post-processing (see Figure 1). In each phase, fairness metrics can be computed for both individual loan applicants and groups of applicants. Here we focus on group fairness.

For this case study, we use the widely cited German Credit data set. This data set consists of 1000 entries and contains 20 attributes of loan applicants. We use this data set throughout this blog series to develop a predictive risk assessment tool, for which group fairness is assessed.

Figure 1 displays a typical ML workflow. The first step in building any ML model is pre-processing: splitting the original data set into training-, validation- and test sets. Next comes the in-processing phase. First, a binary classification model is trained on the training set to predict defaulting. Next, the validation set is used to select parameters for the final model, such as, for example, to select the classification threshold for a logistic regression model or the maximum depth of trees for a random forest model. Finally, in post-processing, the trained model is used to make predictions on the test data.

the pipeline of computing fairness metrics: first, in pre-processing looking at outcome in data, Y, observed data; second in-processing looking at predicted outcome for model validation, Y-hat; third post-processing looking at predicted outcome on test set, Y-hat

Here in Part 1, we compute fairness metrics in the pre-processing phase. In the next post, Part 2, bias detection methods in the in- and post-processing phase are discussed.

Qualitative fairness metrics

In each phase of the ML workflow, quantitative and qualitative reasoning helps you to make fairness tangible. How can we apply those reasoning paradigms to build fairer ML?

Quantitative fairness metrics are based on statistics. For instance, one could compute the difference of the rate of favorable outcomes across certain demographic groups as observed in the original data set. What demographic groups are (un)privileged is, however, a qualitative question.

Let’s examine the default rate for different age groups in the German Credit data set. As can be seen in Figure 2, one can observe that young loan applicants (age 18–25) default more often than older applicants (age>25). How can we deal with this quantitative disparity?

a figure showing a bar chart representing the frequency of default versus no default by age group: for no default the numbers are: 72% for ages 66–75, 74% ages 56–65, 76% for ages 46–55, 76% ages 36–45, 70% ages 26–35, 58% ages 18–25

Consider the two worldviews: we’re all equal (WAE) and what you see is what you get (WYSIWYG). The WAE worldview says that all age groups have similar abilities to repay a loan. One could argue that observed disparities in the data are due to poor data collection processes, or that the available data set may not be representative of the entire population, which is a type of sampling bias. In the WAE worldview, including age as a predictor variable in the risk assessment tool is considered discrimination.

However, in the WYSIWYG worldview, observations in the data reflect the age group’s ability to repay a loan. In this worldview, one could argue that for some reason young loan applicants default more often than older applicants. Therefore, one might want to include age as a predictor variable in the risk assessment tool as it potentially adds predictive power to the ML model. Moreover, in case age indeed adds predictive value to the ML model, excluding this variable from the model will in general lead to unjustified disparate impact, since over-25s are assigned higher risk rates. Causation is however a tricky beast: one should examine carefully why specifically the variable age adds predictive power to the ML model.

So, in assessing group fairness in ML models, one should realize that fairness is primarily driven by values, beliefs, and worldviews rather than objective (quantitative) ground truths. What values and beliefs prevail is subject to socio-cultural, political, and environmental factors, and is therefore a normative exercise. A diverse group of subject matter experts needs to examine case-by-case whether observed disparities in the data are justifiable (WYSIWYG) or unjustifiable (WAE). In performing such bias audits, data scientists play an important role by informing the audience with quantitative fairness measures about the ML model at hand. The AIF360 toolkit helps practitioners to do so.

Quantitative fairness metrics

Let’s apply techniques from the AIF360 toolkit on our case study. What additional quantitative fairness metrics can be computed to assess the alleged disparity between the unprivileged group (under-25s) and privileged group (over-25s)? The code to implement this case study is given below. The full code can be found in this Github repository.

# age as protected attribute
prot_attr = 'age'
age_level = 25

# (un)priviliged groups
privileged_groups = [{'age': 1}]
unprivileged_groups = [{'age': 0}]

# pre-processing data set AIF360
gd = GermanDataset(

# specify protected attribute
protected_attribute_names=[prot_attr],

# initialize priviliged class
privileged_classes=[lambda x: x > age_level],

# default pre-processing
custom_preprocessing=default_preprocessing
)

# split data
gd_train, gd_val, gd_test = gd.split([0.5, 0.8], shuffle=True)

Statistical parity difference
Aggregated statistics about observed defaults across age groups in the training data set are displayed in Table 1. One can compute the difference of the rate of favorable outcomes across the age groups. Observed favorable outcomes by the over-25s equals 291/396=0.735. Observed favorable outcomes by the under-25s equals 59/104=0.567. The difference between these quantities (0.735–0.567=0.168) is called the statistical parity difference. A statistical parity difference score closer to 0 can be considered to be ‘fairer’.

table showing privilege and default rates in the data: privileged (age over 25) had 105 people in the data set default and 291 not default. the unprivileged group (age under or equal to 25) had 45 people default in the data set and 59 not default. the total privileged numbers were 396 and unprivileged were 104. the total default was 150 and the total non-default was 350

Disparate impact ratio
The ratio between the rate of favorable outcome of the unprivileged and privileged group (0.567/0.735=0.772) is called the disparate impact ratio. A disparate impact score closer to 1 can be considered to be ‘fairer’.

Both fairness metrics can be computed using AIF360 (see code below).

# compute metrics 
metric_gd_train = BinaryLabelDatasetMetric(gd_train, unprivileged_groups=unprivileged_groups,
privileged_groups=privileged_groups)
# statistical parity difference
print("Statistical parity difference = %f" %metric_gd_train.statistical_parity_difference())
# disparate impact
print("Disparate impact = %f" %metric_gd_train.disparate_impact())

Again, a diverse group of stakeholders needs to decide whether the measured statistical parity difference (0.168) and disparate impact score (0.772) are justifiable (WYSIWYG) or not (WAE). As a data scientist, you cannot answer this question alone. Note that in the nascent field of algorithmic fairness, no standardized approach exists yet how to settle the WAE-WYSIWYG dispute. In informing the audience with quantitative fairness measures, data scientists should take into account at least the following aspects:

1. Reliability of fairness metrics: Fairness metrics depend on the split of the original data into training-, validation-, and test sets. Fairness metrics should be recomputed on different splits of the original data set to gain summary statistics (mean, median, variance etc.) about fairness metrics.

2. Sample bias: The computed bias scores might be unreliable due to sample bias. In our case study, however, the risk for sample bias is limited since we work with the well-known and peer-reviewed German Credit data set. Sample bias needs to be examined carefully when working with industry data sets.

3. Protected attributes: Even in case the data set is representative, including certain variables in a predictive ML tool might be forbidden by law. For instance, the US Consumer Credit Protection Act prohibits loan approval to be based on (among others) age, race, color, religion and sex of an applicant, regardless of whether this variable holds predictive power (based on historical data) to predict defaulting.

Conclusion

So, we end our case study with the conclusion that democratically legitimized regulators have decided for us to adopt the WAE worldview. This means that age cannot serve as a predictor variable in a risk assessment ML model that supports decisions for loan approval. We learn more about training a ML model while excluding protected attributes in Part 2 of this blog series.

Interested in learning more about the IBM Data Science Community? Join here and please leave a comment or a direct message if you have any questions or ask for a particular article subject to cover next time.

--

--

Jurriaan Parie
IBM Data Science in Practice

Applied data scientist with a strong interests in algorithmic fairness and its societal impact. Interdisciplinary academic background in stats and fair ML.