Measuring fairness in Machine Learning

Published in

Data Science at Microsoft

9 min readFeb 14, 2023

What is fairness?

Fairness is commonly defined as the state of being fair (or equitable). But what’s fair can mean different things in different contexts to different people.

When considering fairness in AI, the fairness constraints to which a model is subjected can be from law, social science, philosophy, or any other perspectives. Commonly, constraints tend to be around sensitive, legally protected attributes. Machine Learning researchers and practitioners want the model to perform as optimally as possible while also treating people “fairly” with respect to these sensitive attributes.

Overview of fairness in AI systems

We are “living in the age of AI.” It seems as if every week, there is a story about some new advancement in applying AI and Machine Learning to real-world problems. However, with all these opportunities come certain challenges. These challenges have received a lot of attention in the media and have highlighted how important it is to get AI right: How do we make sure that AI doesn’t discriminate, or further disadvantage already disadvantaged groups of people?

Researchers at Princeton University found that translating the sentences “He is a nurse.” and “She is a doctor.” in Turkish, a genderless language, and then back into English yielded the stereotypical (and in this case, incorrect) translations “She is a nurse.” and “He is a doctor.”

If you are interested in learning about more of these examples, please see the Microsoft Learn documentation, the Invisible Women book and podcast, the Weapons of Math Destruction book, and the Further Resources section of the Fairlearn website.

Fairlearn

Fairlearn is an open source, community-driven project to help data scientists improve the fairness of AI systems. It includes:

A Python library for fairness assessment and improvement (fairness metrics, mitigation algorithms, plotting, and so on).
Educational resources covering organizational and technical processes for unfairness mitigation (user guide, case studies, Jupyter notebooks, and more).

The project was started in 2018 at Microsoft Research. In 2021 it adopted a neutral governance structure and since then it has been completely community-driven.

The Fairlearn Python module offers different metrics for evaluating fairness. In this article, we walk through examples for the following constraints:

Demographic parity
True Positive rate parity
False Positive rate parity
Equalized Odds
Error rate parity
Bounded group loss

Demographic parity

The Demographic parity constraint can be used with any of the mitigation algorithms to minimize disparity in the selection rate across sensitive feature groups.

Let us consider a scenario in which a model must predict whether a person would be hired or not. The Demographic parity constraint tries to ensure that an equal number of positive predictions are made in each of the sensitive features, for example, gender.

For a classifier h(X) and a, b as sensitive features, it satisfies Demographic parity in case:

P [ h(X) = 1 | A = a] = P [ h(X) = 1 | A = b]

A model predicting whether a man or women would be selected for an interview should not be biased on gender. In this case,

A is a gender in this example (sensitive feature)
Number of women (sensitive feature a) = 10
Number of men (sensitive feature b) = 50

The selection rate is the fraction of data points in each class classified as hired (or 1 in this binary classification), or mathematically:

The selection rate of the model predicting Women being hired is: 2 / 10 = 20%

Meanwhile, the selection rate of the model predicting Men being hired is: 25 / 50 = 50%

The selection rates of women and men are different, hence Demographic parity is not satisfied. We can use Demographic Parity from Fairlearn reductions to mitigate the issue for classification algorithms.

[Notebook reference for Demographic Parity]

True Positive rate parity

When we want to minimize disparity in the true positive rate across sensitive feature groups, we should use True Positive rate parity.

For a classifier h(X) and a, b as sensitive features, it satisfies True Positive rate parity in case:

P[h(X) = 1| A = a, Y = 1] = P [ h(X)=1 | A = b, Y = 1]

Let us consider the same example we considered in Demographic parity. As a reminder:

A is a gender in this example (sensitive feature)
Number of women (sensitive feature a) = 10
Number of men (sensitive feature b) = 50
Y=1 is the state of being hired

The True Positive rate for men and women differs by 76 percent. We can mitigate this issue using TruePositiveRateParity from Fairlearn for classification algorithms.

[Notebook reference for TruepositiveRateParity.]

False Positive rate parity

When we want to minimize disparities in the false positive rate across sensitive feature groups then we should use the False Positive rate parity.

For a classifier h(X) and a, b as sensitive features, it satisfies False Positive rate parity in case:

P [ h(X) = 1 | A=a, Y=0] = P [h(X) = 1 | A=b, Y=0]

Considering the same example as above:

A is a gender in this example (sensitive feature)
Number of women (sensitive feature a) = 10
Number of men (sensitive feature b) = 50
Y=1 is the state of being hired

The False Positive rate for men and women differs by 45.7 percent, which can be mitigated using FalsePositiveRateParity from Fairlearn for classification algorithms.

Equalized Odds

The Equalized Odds constraint can be used with any of the mitigation algorithms to minimize disparity in both the True Positive rate and False Positive rate across sensitive feature groups. For example, in a binary classification scenario, this constraint tries to ensure that each group contains a comparable ratio of True Positive and False Positive predictions.

For a classifier h(X) and a, b as sensitive features, it satisfies Equalized Odds in case:

P[h(X) = 1| A = a, Y = 1] = P [ h(X) = 1 | A = b, Y = 1] and

P[h(X) = 1 | A = a, Y = 0] = P [ h(X) = 1 | A = b, Y = 0]

The number of data points across both sensitive features should be the same for Equalized Odds, so we will need a new example that has the same data points across groups. In the example below, students are applying to different companies from two different states (Karnataka and Delhi).

Equalized Odds is satisfied if candidates are qualified from either Delhi or Karnataka and are equally as likely to get hired, and if they are not qualified from either Delhi or Karnataka they are equally as likely to get rejected.

For this example, we have:

Delhi and Karnataka are sensitive features a, b (A is state)
Y=1 is the state of being hired
Y=0 is the state of being rejected

For Karnataka:

For Delhi:

Note that although the percentage of qualified students getting hired from Karnataka and Delhi is the same, the percentage of unqualified students getting rejected is not the same in Karnataka and Delhi. Hence, Equalized Odds is not satisfied. The Equalized Odds constraint can be used from Fairlearn to mitigate the issue for classification algorithms.

[Notebook reference for Equalized Odds.]

Error rate parity

We can use this constraint with any of the reduction-based mitigation algorithms (such as exponentiated gradient and grid search) to ensure that the error for each sensitive feature group does not deviate from the overall error rate by more than a specified amount.

For a classifier h(X) and a, b as sensitive features and Y is the target, it satisfies error rate parity if:

P [ h(X) ! = Y | A = a] = P [ h(X) ! = Y | A = b]

Let us again consider the example of women and men being hired. In this case:

A is a gender in this example (sensitive feature)
The number of women (sensitive feature a) = 10
The number of men (sensitive feature b) = 50

Because the error rate across the sensitive features has a large difference between the two groups, this model does not satisfy the error rate parity. ErrorRateParity can be used to mitigate this for classification algorithms.

[Notebook reference for ErrorRateParity.]

Bounded group loss

We can use the bounded group loss constraint with any of the reduction-based mitigation algorithms to restrict the loss for each sensitive feature group in a regression model.

The bounded group loss requires the loss across all groups to be less than a user-specified value C.

For a classifier h(X) and a, b as sensitive features, Y as the target, it satisfies Bounded group loss in case:

E [loss (Y, h(X)) | A = a] < C

E [loss (Y, h(X)) | A = b] < C

Let us consider C to be 0.05 for this example, for sensitive feature a, b has loss more than 0.05.

After applying Bounded group loss, the loss across both the sensitive feature is less than 0.05.

Comparison of fairness constraints

For classification problems, we can consider Demographic parity, TruePositiveRateParity, FalsePositiveRateParity, Equalized Odds, and Error Rate, and it is often good to consider using multiples of these metrics for the same model. For regression we consider Bounded group to minimize loss across all sensitive groups

Depending on what the model is predicting, it may change which metrics should be used. Demographic parity and Equalized Odds could be considered when trying to compare the distribution of predicted labels for each group of a sensitive features. While using Equalized Odds we should make sure that the number of people (or data points) across each sensitive feature are the same.

TruePositiveRateParity should be considered when it is important to have true positive. For example, in a COVID-19 test a false negative is when the test shows that a patient does not have COVID-19 when they actually do. In this case the patient may not get treatment and therefore get worse because their disease was undetected. A false negative could potentially be a health risk.

FalsePositiveRateParity is more important when it is important to have a lower False Positive. For example, in a model predicting whether someone is a criminal or not it is generally considered preferable to make a false negative, where the criminal is found innocent when they are actually guilty, instead of convicting someone who is innocent, which would be a false positive.

Error rate parity should be used when the goal is to minimize the error rate across sensitive groups.

When adding fairness constraints, it is also important to consider that while trying to balance accuracy and loss across sensitive groups, the overall accuracy and loss might see an impact.

Conclusion

The accuracy of a model might not mean that the model is fair. If a group has received systematically poorer reviews due to biased managers or worse workplace conditions, then the model might appear to be accurate, but the job evaluation does not accurately reflect the applicants’ potential. In such cases Fairlearn can be used to mitigate the biases.