Algorithm Fairness Case Study

Lintao Ma
3 min readApr 6, 2019

--

A Study on COMPAS Recidivism Dataset from ProPublica

While all four papers (Chouldechova, et al., Hardt, et al., Kleinberg et al., and Corbett-Davies et al.) talk about false positives and false negatives, their definitions of the positive instances in each paper differs slightly. In the Fair Prediction With Disparate Impact, Chouldechova defines positive instances to be those members of a group who are predicted to be likely to recidivate. Kleinberg et al define it in a similar fashion in Inherent Trade-offs in the Determination of Risk Scores as the set of people who constitute positive results. In the case of COMPAS, these would be those people who constitute positive results. The Equality of Opportunity in Supervised Learning by Hardt et al, defines it as those people who have a high enough credit to qualify for a loan, denoted by a Ypred value of 1. Algorithmic Decision Making and the Cost of Fairness by Corbett-Davies et al has no clear and specific definition of positive and negative class in the paper. Instead, the use of COMPAS algorithm in the paper implies a positive instance definition as cases where a person is classified as being more likely to recidivate than not.

In reality, calculating FPR and FNR requires knowing the probabilities of classifier outputs conditioned on the eventual outcomes. In other words, it requires knowing things like “what fraction of loan applicants was denied among those who would have repaid if approved?” Naturally, such possibilities are hard to obtain empirically, and in general may require randomized trials, while no such experiments were performed in the COMPAS analysis. Chouldechova represents the positive class with a risk score S, asserting that a score S that is greater than some threshold sHR would be interchangeable with a positive prediction.

Chouldechova points out that a score that is calibrated may not have predictive parity, illustrated by the image below, where the distributions of risk scores for the majority group deviates from that for the minority group.

In addition to the individual and group fairness criteria studied in the four papers, another interesting discussion covers the ways to interpret the classifiers to ensure fairness since machine learning algorithms often operate in a black box fashion without much information being publicly available. Chouldechova and Hardt provided two ways to further improve the classifiers. Chouldechova proposes to: “1). allow unequal false negative rates to retain equal PPVs and achieve equal false positive rates, 2). Allow unequal false positive rates to retain equal PPVs and achieve equal false negative rates, 3). Allow unequal PPVs to achieve equal false positive and false negative rates” (Chouldechova). Chouldechova focused on the broader application: having a context-specific framework that is more flexible and intuitive to understand. Chouldechova believes that acting as a better benchmark at eliminating potential discriminations, the error rate should be more emphasized when it comes to practice life decision. This way ensures that no specific group is having significantly higher FPR, or being favored. Hardt, on the other hand, the notion of Oblivious measures, which singles out the correlation between the predictor and the joint distribution.

Despite the fact that race as group membership isn’t an input to the scoring function, COMPAS score turned out to be well calibrated between groups. Considering calibration requires context, this is not such a surprise. In fact, Lab 5 has let us explore different scenarios of data manipulation through calibration to authenticate data that is merely seemingly fair. One plausible solution to ensure the scores are calibrated is calibration by the group, implementing different metrics within each group when necessary.

The ProPublica case gave us an interesting and assuring example of a check system where professional data practitioners and scholars are using their expertise to ensure impactive machine learning algorithms are achieving the fairness requirements.

--

--