Build Fairer and More Equitable AI with Bias Testing

Gleb Drobkov
May 11 · 9 min read

By Haniyeh Mahmoudian, Sarah Khatry, and Gleb Drobkov

In the past few years, the media has revealed many stories about bias in machine learning (ML) and AI, such as the hiring algorithm that favored men, targeted ads that perpetuated housing discrimination, or the widely-used healthcare algorithm that exhibited significant racial bias. Proactively identifying and mitigating these and other risks or, in other words, following Responsible AI practices, is a top priority for many AI practitioners. But despite the fact that many approaches to measuring bias and fairness in algorithms exist in the academic literature, few hands-on approaches are readily accessible to data scientists looking to assess and mitigate bias in the models their organizations create.

Data scientists need tools to guide their organizations in the process of building models while taking into account bias-related considerations.

Bias-detection tools are now available that run the gamut from open-source tools that require a high degree of in-house technical knowledge to commercial tools designed for companies without data scientists on staff to implement complex tools.

One such commercial offering is a new tool recently integrated into the enterprise AI platform, DataRobot. The platform contains tools to help data scientists identify and measure bias, investigate its source, and make final modeling decisions that take into account the tradeoff between fairness and accuracy.

Five Steps to Bias Reduction

The DataRobot tools support the Responsible AI workflow in five distinct steps:

1. Defining Algorithmic Fairness

2. Testing Model Bias

3. Investigating the Source of Model Bias

4. Mitigating the Bias

5. Choosing the Final Model

To illustrate these steps, we will use a practical case study based on the UCI Machine Learning Repository adult dataset. This dataset consists of information on working-age individuals including occupation, education, capital gain/loss, and demographic information such as gender or marital status. For the purposes of this case study, we generate a target binary variable that we know is biased against women. The target variable has positive values if the individual income is above $50,000 per year.

We analyzed the features of this data set in depth in a prior article, but for the context of this walkthrough, it is important to note that:

  • This data is skewed and not representative of the current US population[1]
  • Despite these caveats, our analysis focuses on measuring differences in rates. As such, these skews do not impact the relevance of our findings.

Using DataRobot’s AutoML platform, we built 80 competitive models, each predicting a binary indicator of an individual’s income based on the information mentioned above. Additionally, we used the platform’s Bias and Fairness feature to assess the bias introduced by sensitive features in the data[2], and to examine whether our models were exhibiting discriminatory behavior.

1. Defining Algorithmic Fairness

There are numerous, and sometimes conflicting, ways to measure algorithmic fairness with respect to protected or sensitive attributes. Researchers have published countless definitions for fairness and bias. This tutorial by Arvind Narayanan explains 21 of them. As can be seen in this video, it can be quite difficult to define “fairness” in the context of a use case, and to choose the appropriate metric.

Across bias and fairness definitions, two broad categories of metrics can be identified: fairness by representation and fairness by error.

1. Fairness by representation focuses directly on model predictions to evaluate the likelihood of each group receiving the favorable outcome: Is a particular group more likely to be treated favorably or unfavorably by the model? Some examples include proportional parity or equal parity.

2. In fairness by error, the priority is model performance in terms of accuracy: Across groups, is a model committing similar types of error, or is one group impacted more than another by inaccurate predictions? This would include metrics such as specificity and sensitivity parity, or precision and negative predictive value parity.

For people who may be new to algorithmic fairness, or who might otherwise feel unsure how to choose the appropriate metric, the platform provides a questionnaire to guide the user towards the supported metric that is most relevant for a given use case.

Helper tool for selecting the appropriate fairness metric.

Metrics can often be conflicting, especially when comparing a fairness by representation metric to fairness by error. Choosing the right one depends on the context of the use case, the data, and the target.

2. Testing Model Bias

Unlike the analysis in the previously mentioned article, the following test is applied on the model’s predictions on the validation set of data. For this use case, we will assess fairness by representation through the proportional parity metric, which calculates the likelihood of an individual in a particular group being assigned the favorable outcome (in this case, a higher income). As a fairness by representation metric, our goal is to see that the model does not disproportionately assign favorable outcomes on the basis of gender.

The results of proportional parity from the validation set are presented in the figure below.

As you can see, our predictions have a significant difference in the probability of having a high income between males and females, with women predicted to have less than half the rate of high incomes than men (19.4% of men are predicted to be high income vs 8.7% of women, resulting in a ratio of 0.45).

Proportional Parity fairness metric — male and female model scores.

Thus, on the basis of proportional parity we would judge this to be a biased model, in which the bias is inherited from historical outcomes.

When measuring bias and fairness, the optimal scenario is for the data scientist to work in conjunction with other stakeholders to identify the most appropriate metric, and report that decision with clarity. The choice of other, perhaps more inappropriate, metrics would lead to different conclusions.

3. Investigating the Source of Model Bias

Using another tool in DataRobot’s bias and fairness feature, we will dig deeper into the data to understand what factors were contributing to this historical discrepancy. The Cross-Class Disparity tool compares the distribution of different features across the male and female groups in the dataset. It calculates feature importance and uses it to label observed disparities as having either minor, moderate, or major impact on the prediction bias.

In this instance, two features are labeled as “moderate” (yellow): hours per week, and type of occupation. One feature is labeled “major” and is, indeed, the target inasmuch as it reflects the historical skew present in the data.

In the screenshot below, note that:

  • Hours worked per week is skewed more toward lower values relative to men, suggesting that many women in the dataset hold part-time jobs.
  • By looking at occupation, we found that women in the dataset had lower salaries on average compared to men in the same professions, according to the U.S. Bureau of Labor Statistics.
  • A combination of these features may be contributing to the biased income predictions for males and females.
Cross-class disparity evaluation tool.

4. Mitigating the Bias

Now that we have investigated the data, we will turn to an attempt to mitigate the data bias. We have a number of options / techniques we can employ. For example:

· We can remove “Gender” and other features we identify as gender proxies in the model.

· Then we can evaluate that modeling approach side-by-side with the original models to see if there is a model-related reduction in bias and/or accuracy.

· Then there are other, more advanced techniques available to mitigate bias at each modeling-workflow stage: pre-processing, in-processing, and post-processing. We will not review these today, but approaches such as Fairlearn provide options for data scientists looking for more advanced bias mitigation methods.

This article addresses entry-level approaches to reducing bias by selecting models which are less biased yet still acceptably performant. However, the next articles in our Responsible AI series will address other approaches to tangibly reduce bias in models by removing sensitive features and in model post-processing to enforce a desired fairness scenario.

5. Choosing the Final Model

Having run the gender-blind models, we’ve come to the last step for now in the modeling process. How do we choose the final model to deploy and use? This requires a multidimensional analysis: balancing bias and fairness measures.

Data science practitioners are very used to talking about the trade-off between model performance and speed and scalability. Now, with growing awareness of algorithmic fairness, we can consider the trade-off as well between a model’s accuracy and its level of discrimination. This should be decided on an organizational level, and the accepted tolerances for the particular fairness metric should be set and tracked. In employment, the four-fifths rule[3] requires the result of a proportional parity test to have a greater-than-0.8 ratio across groups in your data. This is not prescriptive to all use cases, but an example of what a model’s bias-tolerance threshold might look like. For some use cases, it may be sufficient to demonstrate a partial reduction in bias from historical practices, with the goal of incrementally improving outcomes for marginalized groups.

For this modeling project with the Adult dataset, we will now compare the top-performing models that used “Gender” as a feature to a model that excluded “Gender.” In doing so, we observed:

Increasing fairness marginally decreases accuracy: The model that excludes gender (purple dot) was slightly fairer by representation (that is, through proportional parity). However, it was less accurate overall compared to the other four (which cluster together visibly in the chart).

All the models are slightly biased — the skew in the data is pervasive: None of the models, including the model that excluded gender, achieved a proportional parity ratio greater than 0.8 (the green band on the chart). As such, further mitigation techniques are likely recommended.

Fairness metric and validation score (performance) for each candidate model.

Conclusions for a Data Scientist Practicing AI with DataRobot

Even the highest-performing model we tested, the gender-blind version, was able to attain a proportional parity ratio of only slightly more than 0.4. But it is worth noting that the trends observed in the underlying data are also contributing to the bias.

For example, more of the female data records indicate part-time rather than full-time employment, with women also overrepresented in clerical and administrative occupations when compared to men. The model is legitimately predicting that those characteristics tend to be associated with individuals who do not achieve high incomes. This process of considering the contribution of underlying features is a nuanced, yet important, step in interpreting model bias.

A different bias metric, called conditional proportional parity, would take that nuance into account in its evaluation of model performance. Conditional proportional parity seeks equivalent assignment of favorable outcomes when conditions, such as similar employment type or number of hours, are met. Many organizations might like to consider these different bias metrics when selecting their models, and track them over time to actively manage algorithmic impact.

In conclusion, any evaluation of model bias is a multi-stakeholder and iterative process, and should be considered an integral part of the risk and impact assessment of any AI use case that may have an impact on humans. The approaches outlined in this article are simple ways to analyze the fairness of different models. We hope you will consider applying them to your own data science projects.

[1] Both the proportion of females in the population the and high-income ratio vary from what is expected. (33% female vs. 50% expected, and 0.36 ratio of high-income, vs. the 0.82 ratio of median salaries for the same profession between women and men in the US in 2019).

See the website for the recent Equal Pay Day 2021 for more details on US gender income disparity.

[2] A list of common protected attributes defined by US EEOC includes Race, Color, Religion, Sex (including pregnancy, sexual orientation, or gender identity) National Origin, Age. But each use-case is different, and organizations may have varied definitions of protected groups and discriminatory bias.

[3] The four-fifths rule is a threshold set in US employment discrimination law for determining adverse impact. It states that if the selection rate for a certain group is less than 80 percent of that of the group with the highest selection rate, there is adverse impact on that group.


GAMMAscope - The Blog

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store