Algorithmic Bias and the Confusion Matrix Dashboard
How a Confusion Matrix Behaves Under Distributions of Prediction Scores
As algorithms increasingly make decisions about human affairs, it is important that these algorithms and the data they rely on be fair and unbiased. One of the diagnostics for algorithmic bias is the Confusion Matrix. The Confusion Matrix is a table that shows what kinds of errors are made in predictions. While everyone who works with data knows what a Confusion Matrix is, it is a more subtle matter to gain intuition for how it behaves under different kinds of distributions of predictions and outcomes and the range of possible decision thresholds.
In this article, I walk through an interactive Confusion Matrix Dashboard that you can play with to explore different data sets and prediction models, and watch how the Confusion Matrix behaves. You can load your own data. Two of the included data sets are purely synthetic distributions with knobs that you can adjust. Another data set contains synthesized examples that illustrate how algorithmic bias can be distinguished from ambient imbalances. By ambient imbalances, I mean that different groups can inherently hold different distributions of features that lead to different distributions of outcomes. I propose a novel measure for prediction bias, called the Positive Prediction Ratio Score (PPRS), that is independent of the Confusion Matrix, but instead compares curves of positive outcome ratios across the range of prediction scores.
The Confusion Matrix Dashboard also includes two sets of real data about serious matters, accompanied by a few prediction models. One model of interest is the COMPAS model that is used to predict criminal recidivism. The COMPAS model has come under fire for alleged algorithmic bias. This claim is based on the way that False Positive and False Negative rates show up in the Confusion Matrix for different racial groups. There is however no single consistent way to define algorithmic bias. The Confusion Matrix Dashboard allows us to explore ways that the underlying data distributions and prediction models can give rise to allegations of bias that might be misguided.
To accompany this article, I prepared some videos that walk through the main concepts. The 2-minute promo for this article is:
Algorithmic Bias and the Confusion Matrix Dashboard — Promo
Prediction Score
To appreciate the purpose for a Confusion Matrix, consider a process that produces a prediction score for a binary outcome variable. After assigning the score, we perform an experiment and make an observation. The observation outcome is tallied as either Positive or Negative. Doing this many times, we can build two distributions of outcomes as a function of prediction score, one distribution for the Positive outcomes, and one for the Negative outcomes.
The Confusion Matrix Dashboard allows you to experiment with two different kinds of interactive synthetic distributions. One synthetic distribution defines scores for Positive and Negative outcomes to fall in bumps, or “bell curves.” You can play with the height, width, and locations of the Positive and Negative score bumps. A second kind of distribution places the Positive and Negative outcome scores more uniformly along the prediction score axis. You can play with the rise and fall of these distributions as score increases.
Example: Apple Snacks
Here is a realistic made up example. The example starts simple, but it builds to allow to us evaluate concepts of algorithmic bias and fairness.
Suppose you are packaging snack boxes for 500 kids for a school picnic. Some of the kids like apples, while others prefer something else like popcorn, a cracker, or cookie. You have two kinds of snack box, one with an apple, and one with something else. You must decide in advance which kind of box to give to each kid, which you will then label with their name and hand out to them. When each kid opens their snack box, they will either be happy with their snack and say “Yay!”, or else they will be disappointed and say “Awwww”.
To help you decide, you predict which kind of snack each kid likes based on some rules of thumb. Older kids tend to like apples while younger kids do not. Taller kids like apples while shorter kids don’t. Kids in Mr. Applebaum’s class tender to prefer apples, while kids in Ms. Popcorn’s class want something else. There are no hard and fast rules, just educated guesses.
For each kid, you give them a point score indicating the prediction that they will want an apple. A score of 10 means you are quite sure they’ll want an apple, for example a taller 10-year-old in Mr. Applebaum’s class. A score of 1 means you’re confident they will not want an apple, like a shorter 6-year old in Ms. Popcorn’s class.
For each kid, after calculating their prediction score, you make a decision. You might set the decision at the mid-point score of 5. Or, you might set the apple snack threshold higher or lower. For example, if it’s important that the kids eat fruit, you’ll set the threshold lower so that you’ll catch more kids who prefer apples. Or, if you want to err on the side of having fewer apple slices discarded in the trash, then you’ll set the threshold higher, so fewer apple snacks are handed out. In other words, your decision depends on the tradeoffs for different kinds of errors.
At the picnic, you record kids’ reactions. You write down what prediction score you gave them, whether you gave them an apple or not based on the decision threshold, and what their reaction is. This is the payoff for your decision.
We tally outcomes in a table with rows corresponding to kids’ preferences and columns corresponding to your decision to give each one an apple snack or other snack. The Confusion Matrix counts the numbers and ratios for each quadrant of the table.
The raw counts table is in the upper left, while the ratio table is on the right.
Assessing Performance
Your performance depends on several things:
- How accurately your prediction matches what each kid wants. Ideally, your predictions would distribute all of the apple-wanting kids (red) on the right side, and all of the no-apple kids (green) on the left. That way you can put the threshold in the middle and correctly predict each kid’s preference.
- How many kids actually want apples or not. This is the Base Rate, the total proportions of red apple kids or green no-apple kids.
At this point, you might take a moment to play with the Dual Unimodal distribution synthesizer in the Confusion Matrix Dashboard. Arrange the Positive (red) and Negative (green) bumps with scores that keep the distributions fully separated. Then, make the decision harder by having the distributions overlap. Slide the threshold left and right to see how the counts and ratios in the Confusion Matrix change.
There are various ways of assessing your performance. You can focus on the proportion of kids you predicted correctly (TPR and TNR). (TPR: True Positive Rate is also known as sensitivity, and TNR: True Negative Rate is also known as specificity.) Or you can focus on the kids you got wrong (FPR: False Positive Rate and FNR: False Negative Rate). Or, you can focus on what proportion of kids you thought would want an apple, and then they really did. This is called Precision. Precision, also known as Positive Predictive Value (PPV), is one of four terms referred to by Berk et al. as Conditional Use measures [1], shown in the bottom table.
Then, these measures can be combined further into aggregate measures that have names like “accuracy”, F1 score, and MCC.
The ROC (Receiver Operating Characteristic) and Precision/Recall Curves show how some of the measures trade off with one another as the decision threshold changes. You can read all about these measures on Wikipedia and many other explanatory sites.
Although here we consider only binary decisions, multi-variate predictions produce Confusion Matrices with more rows and columns, and elaborations on other performance measures.
Prediction Bias
Consider two populations of kids, one of which overall prefers apples more (higher Base Rate), while the others prefer apples less (lower Base Rate). Then it turns out, mathematically, that if the Positive and Negative distributions overlap (are not completely separable by setting the decision threshold) then the Confusion Matrix entries for these two populations cannot be the same with respect to False Positive Rate, False Negative Rate, False Discovery Rate, and False Omission Rate [2] [3]. Yet these are all valid candidates for assessing fairness and bias in a decision process. In other words, there exists no single measure for determining whether a decision process is biased for or against predicting Positive versus Negative outcomes for one population versus another.
To more deeply appreciate this fact, I generated synthetic data for the Apple Snack decision problem. A kid’s preference for apple snack versus other is taken to be a linear function of four attributes, age, height, teacher, and pet. Here is the formula:
Age ranges continuously from 6 to 10. Height ranges continuously from 40 to 60 inches. Class takes one of three categorical values, {Ms. Popcorn=0, Miss Fruitdale=1, Mr. Applebaum=2}. Pet takes one of four values, {turtle=0, fish=1, cat=2, dog=3}. The coefficients c provide weightings for the attributes. For simplicity, we’ll set all coefficients to 1.0, but normalize the attribute values to the range 0 to 1. That way, all attributes count equally. Prediction score is normalized to the range, 0 to 1, then treated as a probability that the kid prefers an apple snack (1.0) or other snack (0.0). For plotting graphs, prediction scores are divided into some number of bins. For this problem, 20 bins show the distribution curves nicely.
To generate synthetic kids, their prediction scores, and preference outcomes, we sample generator distributions for the four attributes. For the two continuous attributes, age and height, let’s use a linear generator distribution that can be uniform across the range, or else tilted toward the lower or upper end. If the generator distribution for age is tilted upward, then more older pretend kids will be generated than younger. Because preference for apple increases with age, then this will skew the distribution of prediction scores higher.
Similarly, for the two categorical attributes, we could say that all teachers are assigned with equal probability, or else differently. For example, if more kids are assigned to Mr. Applebaum, then that will skew the score distribution higher because Mr. Applebaum contributes a prediction score factor of 2 (before range normalization), compared to 0 for Ms. Popcorn.
Once a kid is assigned attributes by sampling from generator distributions, we can treat their prediction score as a probability for preferring an apple snack. Then, the sample is completed by flipping a biased coin to generate a pretend preference outcome for that kid, according to the probability. Because of statistical variability, the number of Positive and Negative outcome kids for each prediction score bin will vary from trial to trial. For this reason, in the examples, I sample 100,000 pretend kids, just to smooth out the statistics.
The result is a preference distribution of Positive outcomes (prefer apple snack) and a distribution of Negative outcomes (prefer other snack). The Confusion Matrix Dashboard includes five different experimental conditions with different generator distributions:
Condition 1. Uniform: All attributes are sampled uniformly across their ranges. Due to the Central Limit Theorem, this generates a bell-shaped distribution of prediction scores. As a result of generating kids’ actual preferences based on prediction score, this leads to distributions of Positive outcomes and Negative outcomes that are somewhat spatially separated. Setting a threshold at the middle shows that this distribution achieves True Positive and True Negative rates of .64, and False Positive and False Negative Rates of .36. Precision and Negative Predictive Value are both .64. The ROC curve has area .7, which is generally regarded as acceptable predictive power, but far from perfect.
In order to explore issues of algorithmic bias, I separated the generator distributions by sex. Girls get one generator distribution for each of the four attributes, while boys may get a different one. This reflects the fact that, for whatever reason, it may turn out that girls have a different age range in the population, a different height range, are assigned differentially to the teachers, or have different preferences in pet. The Uniform experiment starts off by using identical generator distributions for boys and girls. Not surprisingly, their respective prediction distributions are therefore the same.
Condition 2. Skewed Same: All attributes are sampled with generator distributions skewed toward the high side. More older and taller kids are generated then younger and shorter. More kids are placed in Mr. Applebaum’s class than Ms. Popcorn’s. More kids have dogs as pets than turtles. As a result, the snack preference skews toward preferring an apple snack over other snack. In the Skewed Same condition, girls’ and boys’ generator distributions are skewed identically. Consequently, their resulting apple snack preference distributions are the same.
Condition 3. Skewed Opposite: Girls’ attributes are sampled with generator distributions skewed toward the high side, while boys’ are skewed lower, i.e. boys are younger, shorter, more in Mr. Popcorn’s class, and have more turtles and fish instead of cats and dogs. Consequently, the snack preferences for girls skew toward apple while boys skew toward other snack. Overall, girls prefer apple at a Base Rate of 58%, while boys prefer other at that same rate.
What’s most interesting is to compare the four quadrants of the Confusion Matrix. (I’ll use approximate numbers to account for sampling noise.) With a threshold of 10, in the middle of the prediction score range, girls show a True Positive Rate of .79 and a True Negative Rate of only .46. This is because girls’ preferences are skewed to place more apple preferring girls (red) to the right of the preference score midpoint. For boys it’s the reverse. Similarly, girls show a False Positive Rate of .54 because slightly half of their green (other-snack) preference falls to the right of the threshold. Their False Negative Rate is only .21. Again, for boys this situation is reversed. Interestingly, Precision (PPV) and NPV are closer, although not identical.
One might look at these Confusion Matrices and decide that the prediction scoring and decision process is biased. Whether biased for or against girls depends on whether you like apples or not, I suppose. There is however a way to bring the four quadrants, TNR, FPR, FNR, and TPR into congruity. That is to adjust the decision threshold for girls upward, and the decision threshold for boys downward, until they are separated by three bins. For example set the girls threshold to 12 and the boys threshold to 9. As you make these adjustments in the Confusion Matrix Dashboard, you can see the decision points change on the respective ROC curves.
Due to the mathematical constraints mentioned previously, this adjustment brings the Precision (PPV) more out of alignment, and along with it the other Conditional Use measures, NPV, FDR, and FOR. You cannot have it all. You can, however, adjust the thresholds to bring the four Conditional Use terms into closer alignment if you want to, at the cost of driving girls’ and boys’ four Positive and Negative ratio measures further apart.
Generally, scholars and policy experts throw up their hands and say that since there is no single unified measure for algorithmic bias to be found in the Confusion Matrix, the determination has to be based on policy. Namely, what kinds of discrepancies in performance metrics between populations are acceptable and what are not?
Undoubtedly, policy-based tradeoff decisions are essential. But as a technical matter, in assessing predictive bias, we can do better than to rely on the Confusion Matrix alone.
Motivated by the Apple Snack experiments, let’s consider a different approach to judging whether prediction scores are biased or not. This approach focuses on the consistency of the scores to predict outcomes. For a given prediction score bin, do outcomes show girls and boys to prefer apple snack or other snacks in equal proportion, regardless of how many girls or boys happen to have that score? This is known as calibration fairness [2]. The property derives from a value called Positive Prediction Ratio, which is simply the proportion of Positive outcomes at each bin. Across the range of prediction scores, the Positive Prediction Ratio traces out a curve, drawn in tan color in the Confusion Matrix Dashboard.
Positive Prediction Ratio Score
Under the Skewed Opposite generator distributions, girls’ and boys’ Positive Prediction Ratio curves are the same. Click on the checkbox under the PPRS label to overlay them to compare. This means that the prediction scores accurately reflect the preferences of the simulated kids, regardless of whether the kid is a member of a population that prefers apple snacks or other snacks.
This observation motivates a summary measure for bias based on the differences between the Positive Prediction Ratio curves for two populations. Specifically, compute a function of the area between the two curves. There are numerous ways to do this. I chose a very simple one:
This is the sum, over prediction score bins, of the squared difference in ratios r of positive outcomes to all outcomes between two populations, 1 and 2, weighted and normalized by a factor. The weighting factor is the minimum number of counts in each bin, between the two populations. The motivation for this weighting is that probabilities can vary widely under sample noise when there are few samples. A difference in Positive Prediction Ratio is not as significant when either of the populations places few samples in a bin. The PPRS measure is subject to statistical noise which depends on sample size per bin.
The PPRS for the Skewed Opposite condition is .01. Generally, in my estimation, a score below .2 indicates relatively aligned Positive Prediction Ratio curves and can be considered unbiased, while a score above around .6 indicates some significant deviation in the curves, and suggests prediction bias.
What does prediction bias look like? The next two generator distributions provide examples.
Condition 4. Skewed Opposite Bias Girls: Like the Skewed Opposite experiment, generator distributions are set up to skew girls snack preferences toward the high side, while boys are skewed lower. But this time, girls’ prediction scores are uniformly shifted downward by 0.1 (leftward by 2 bins).
Note how in this condition the girls and boys preference distributions superficially appear to be more similar to one another than in Condition 3. The green and red curves are more closely aligned. But by shifting girls’ prediction scores, at each bin the ratios of Positive predictions to number of kids in the bin is now very different between boys and girls. Girls will get fewer apples than they prefer. This discrepancy is reflected in the Positive Prediction Ratio Curves, which are now visibly separated. The gap between them delivers a PPRS of 1.0, which is a significant indicator of predictive bias.
Condition 5. Uniform Randomize Girls: This experiment starts with identical uniform generator distributions for girls and boys, the same as in Condition 1. Except this time, half of the girls are randomly selected and assigned a random prediction score. You can see that the girls Positive and Negative outcome distributions (red and green curves) are flattened. The ROC AUC (Area Under Curve) drops and fewer girls are getting their preferred snack. The PPRS is a relatively high 0.75.
Python code for producing Apple Snack samples using your own versions of generator distributions is available in the GitHub repo for this project [4].
Real World Data: Titanic Survival
We can examine real data in the Confusion Matrix Dashboard. A well-known data set lists passengers who survived versus perished in the sinking of the Titanic in 1912. For each passenger, the data set tabulates features such as age, sex, family size, fare class, and ticket price. The Titanic survival prediction task is a starter project on Kaggle.com, and accuracies in the 80% range are typical. For this data set I built two models; their predictions are included in the Confusion Matrix Dashboard. A simple model is a Logistic Regressor, which calculates a weight factor for each of the possible feature values (after some simple feature engineering). A more sophisticated Machine Learning algorithm is a Gradient Boosting Regressor. This uses an ensemble of decision trees and can take into account complex nonlinear interactions among observed features. The GBR model performs just slightly better than the Logistic Regressor for this data set and features, with an AUC of .86 and best f1 score of .78 (at decision threshold = 4).
The Titanic models show a number of interesting properties. The survival rate for males was 19% while for females it was 74%. Women and children to the lifeboats first! Suitably, the prediction algorithms display two widely separated bumps. Looking at male and female data subsets, the green Negative outcome bump occurs in the male subpopulation while the red Positive outcome bump occurs with females. At the midpoint decision threshold of 5, the True Positive rate (predicted to survive, and did) for females and True Negative rates for males (predicted not to survive, and didn’t) are both very high. The False Positive and False Negative rates for male and female passengers are highly asymmetric (FNR = .79(m)/.1(f), FPR = .02(m)/.58(f)). This reflects that the large majority of males are predicted lower probability of survival while the large majority of females are predicted to survive, even though there are outliers in both directions. With the relatively simple set of passenger features provided, the models are unable to reliably predict outliers. The Positive Prediction Ratio Curves and PPRS are not informative under the male/female breakdown of these titanic survival models because the outcome distributions have very little overlap, leaving very sparse statistics for comparing Positive prediction ratios in the middle range score bins.
Real World Data: Broward County Recidivism
In May, 2016, ProPublica published a study entitled, “Machine Bias: There’s software used across the country to predict future criminals. And it’s biased against blacks.” [5]. The article assesses an algorithm called COMPAS which is used to predict recidivism in criminal cases. Accompanying the article is a github repository containing both the data and the data analysis methods used to assess the COMPAS algorithm’s performance.
The ProPublica article makes the following claim:
“We also turned up significant racial disparities, just as Holder feared. In forecasting who would re-offend, the algorithm made mistakes with black and white defendants at roughly the same rate but in very different ways.
- The formula was particularly likely to falsely flag black defendants as future criminals, wrongly labeling them this way at almost twice the rate as white defendants.
- White defendants were mislabeled as low risk more often than black defendants.”
Claims about algorithmic bias are concerning and this article has been heavily cited. After all, as a matter of fairness, we might expect errors in both directions to be equally distributed under an unbiased algorithm. ProPublica makes its allegations based on differences in the False Positive Rates and False Negative Rates in the Confusion Matrices for Black and White subpopulations.
However, we have seen from the Apple Snacks example that interpretation of bias or unfairness is not straightforward. There is a strong interaction between Base Rates, the Positive and Negative rates in the Confusion Matrix, and the Conditional Use Measures (the most well-known of which is Precision (PPV), but which also include Negative Predictive Value (NPV), False Omission Rate (FOR), and False Discovery Rate (FDR). The tradeoffs among these have been discussed at length in the Criminology, Statistics, and Computer Science literature, and the fairness of the COMPAS algorithm was hotly debated [1, 6, 7, 8, 9]. The Apple Snacks examples show that differences in the distributions of preferences, or probability of outcomes, among different groups, are not in and of themselves indicators of prediction bias. Collectively, girls attending a picnic may for whatever reason actually prefer apple snacks more than boys do.
We can use the Confusion Matrix Dashboard to view the Broward County Recidivism data and test how different decision thresholds affect the Confusion Matrix and its derivative scores. Broward County’s COMPAS decision threshold is set at decile 4. The ProPublica data set provides not just recidivism predictions resulting from the COMPAS algorithm, but the actual prediction scores, which are called “decile scores” because COMPAS reports its predictions on a scale from 1 to 10. This information allows us additionally to compare Positive Prediction Ratio Curves between subpopulations, and calculate the PPRS (Positive Prediction Ratio Score).
Examining the Data Source, Broward Recidivism — COMPAS model, in the Confusion Matrix Dashboard, we verify that, indeed, at a threshold of 4, the False Positive Rate for Black defendants is .42 while for White defendants it is only .22. And the False Negative Rate for White defendants is .5 while for Black defendants it is .28.
This discrepancy must be evaluated in light of the fact that the Base Rate for Black defendants is .52, while for White defendants it is .39. This is reflected in the greater volume of red (Positive) bars in the stacked histogram for Black defendants. However, similar to what we saw in Apple Snack Experimental Condition 3, the discrepancy can be removed by adjusting the decision threshold for Black defendants upward to 5, and for White defendants downward to 3. This aligns the two populations’ TPR/FPR point on the ROC curve, although it further exacerbates the discrepancy in Precision and Negative Predictive Value. And obviously, setting different thresholds for different protected groups would be considered unfair.
The question becomes, is the discrepancy in FPR and TPR due to predictive bias, or instead due to differences in ambient characteristics between the two subpopulations? Notably, the COMPAS algorithm assigns prediction scores to the two groups consistently with respect to their actual recidivism outcomes. The Positive Prediction Ratio Curves are pretty well aligned, and the PPRS measure of the gap between them obtains a relatively small value of 0.17, indicating good calibration accuracy.
The COMPAS algorithm assigns prediction scores with a relatively even distribution across the decile range of 1–10; unlike the Apple Snacks synthetic data, the distributions are not bell shaped. The most apparent difference between Black and White defendant populations, apart from Base Rate, is that Black Defendants are assigned approximately uniformly, while White Defendants are found to be more heavily represented at lower decile scores, decreasing steadily as decile score increases. You can approximate these distributions using the Approximately Linear Data Source dropdown of the Confusion Matrix Dashboard, and see how the terms in the Confusion Matrix respond.
Independent Predictive Models for Broward County Recidivism
Another way to investigate the Recidivism data set is to build an independent model that is guaranteed to exclude race as a factor. Suppose we observe that recidivism is correlated with race or sexual orientation. Then these factors will in fact be predictors of recidivism, on a statistical basis. But in a fair society, we agree that individuals should be judged on their merits and not on the basis of identity classes they happen to belong to.
ProPublica reports that at the time that most defendants are booked in jail, they respond to a COMPAS questionnaire. This information is fed into the COMPAS algorithm. It is unclear exactly what information is used by the algorithm. If race or other protected class factors are used, then that could be unfair.
To eliminate this possibility, we can extract only data that is straightforwardly based in the criminal record, and build a model from that. Like with Titanic data, I built Linear Regression model and a Gradient Boosting Regressor model. For comparison, both are included in the Confusion Matrix Dashboard. I extracted only the following features from the data records:
- age
- juv_fel_count
- juv_misd_count
- juv_other_count
- priors_count
- c_charge_degree
- c_charge_desc
These features pertain to age, prior convictions, and charge type. In this data set, only one charge was entered at booking time, so the raw c_charge_degree and c_charge_desc features have one and only one value.
In order to employ features like this in an algorithm, data scientists convert them into numbers that are likely to be correlated with the outcome we want to predict. Therefore, I did some feature engineering. I grouped raw age into five age buckets: {18–22, 23–33, 34–50, 51–59, 60+} I grouped the count features into buckets of four or five categories. For the charge description, I created a “one-hot” vector. I grouped all of the drug-related charges into one combined charge. All of the charge descriptions that had fewer than 10 instances I grouped into a charge category called “Other”. Then for each of the categories, I assigned a 1 if that charge was reported in the raw ‘c_charge_desc’ feature, and 0 otherwise. Finally, I included one numeric feature that is the average of the recidivism rate for the applicable charge description.
To train a model, we note the observed value of the reported variable, ‘two_year_recid’. The outcome value (or dependent variable) is given a 1 if the defendant recidivised, and a 0 if they did not recidivise within two years.
At prediction time, the independent features are fed into the model to produce a prediction score. Below are the results for the GBR model, as displayed in the Confusion Matrix Dashboard. The prediction distributions are shaped differently from the COMPAS model. But the disparity in False Positive Rate and True Positive Rate between Black defendants and White defendants persists. And like with the COMPAS model, these Confusion Matrix entries can be brought into congruence by choosing different decision thresholds for the two subpopulations (which would be considered unfair). Finally, as with the COMPAS model, the Positive Prediction Ratio Curves are well aligned, and the PPRS of predictive bias is low, hovering at about 0.09. Again, this is with models that exclude race and sex as features.
One concern with predictive models is known as data leakage. Even though race, sex, or other protected features may be formally excluded from consideration, these factors can sometimes be inferred from other factors. For example, zip code redlining has notoriously been used as a proxy feature for race in setting insurance rates.
In the case of the synthetic Apple Snacks data with different feature generator distributions for girls and boys, one might interpret these seemingly benign features of age, height, teacher, and pet, as backchannel indicators for a kid’s sex. It doesn’t matter. Even if you include sex directly as a feature, the shapes of girls’ and boys’ distributions remain the same. The way we achieved predictive bias in Apple Snack conditions 4 and 5 was not by introducing sex factors in the shaping of prediction distributions, but by perturbing prediction scores from what they should be.
In the case of Broward Recidivism data, it is possible that the features used such as age, priors, charge severity, etc. reflect racial biases in the criminal justice system or in society. We cannot address these factors here, and we draw no conclusions about the fairness of the underlying data or the criminal justice system as a whole. Here we’re just focusing on predictions of future actual recorded recidivism, based on the data set collected.
My conclusion is that the Broward prediction algorithms are not biased by virtue of their differential distributions of prediction scores between these two subpopulations. To shovel defendants elsewhere along the prediction score axis while preserving the Base Rate (the total number of recidivising (red) and non-recidivising (green) counts) would distort the respective Positive Prediction Ratio Curve and increase the PPRS measure of predictive bias. The more correct interpretation is that, according to the way data features and outcomes were captured, the Black Defendant subpopulation inherently exhibits a greater proportion of characteristics that place them toward higher prediction scores. Disparities in FPR, TPR, and Conditional Use measures are inescapable consequences of that.
Conclusion
It would be good to escape the conundrum of choosing among incompatible and perplexing measures of alleged bias based solely on the Confusion Matrix and its derived metrics. In this article I propose a new measure for predictive bias, the Positive Prediction Ratio Score (PPRS). This compares more directly the agreement of outcomes as a function of prediction score between subpopulations, regardless of decision threshold which governs terms in a Confusion Matrix. The PPRS measure aligns with calibration accuracy, and it does not attempt to rectify ratio discrepancies that arise from choosing among decision thresholds when populations have different base rates and distributions. Under this measure, the observed differences between racial groups in the Broward Recidivism data is due to differences in the ambient distributions of their feature characteristics, and is not attributable to bias on the part of any predictive algorithm.
The issue of fair and appropriate use of algorithms and data is large and difficult. Society functions though myriad data collection and decision-making processes. Long before the arrival of the Information Age, rules, guidelines, and procedures evolved to achieve systematic outcomes in economics and governance. In this sense, when humans follow established policies, they are executing a form of algorithmic decision-making. Human judgments carry the benefits and costs of subjectivity. In addition to strict rules and formal guidelines, people are able to take into account factors that are not easily captured by numbers and ledgers. Consequently, they may consciously and unconsciously weigh factors that fall outside legitimate bounds. They bring empathy and insight, resentments and grudges.
One promise of formalized data collection and algorithms is simply efficiency and scale. The number of factors and the speed and efficiency at which they can be processed far surpasses the capacity of trained professionals, bureaucrats, and agents.
A second promise of algorithms is that they allow control of factors and uniformity in decision processes. The input variables may not encode human backstories that might sympathetically sway a decision in one direction or another. But they also exclude factors that can sway decisions based on unjust prejudices. The impact of formalized algorithms must be assessed at a systemic level. This includes understanding how data is collected, how features are extracted from the data, and how algorithmic decisions are made.
Because computational algorithms employ technologies that are unfamiliar to ordinary people, and sometimes, processes that are difficult for even experts to fully understand or explain (as in the case of some Machine Learning algorithms), it is natural and appropriate that they be subject to suspicion and scrutiny. When algorithms make bad decisions due to biased data or improper design, they must be called out. Conversely, algorithms should not be maligned for correctly calculating outcomes that reflect unpleasant properties of their systemic environments. To do so would remove a powerful tool that is capable of removing bias and prejudice, and increasing fairness in society.
The Confusion Matrix is a relatively simple concept, whether prediction scores are assigned by machine algorithm or some other means. But its behavior under different types of data is subtle and complex. Through the interactive Confusion Matrix Dashboard, we intend to make the behavior of both algorithmic and human predictions and decisions more comprehensible and transparent.
References
[1] R. Berk, H. Heidari, S. Jabbari, M. Kearns and A. Roth, “Fairness in Criminal Justice Risk Assessments: The State of the Art,” Sociological Methods & Research, pp. 1–42, 2017.
[2] J. Kleinberg, S. Mullainathan and M. Raghavan, “Inherent Trade-Offs in the Fair Determination of Risk Scores,” in 8th Conf. on Innovations in Theoretical Computer Science (ITCS), 2017.
[3] S. Goel, E. Pierson and S. Corbett-Davies, “Algorithmic decision making and the cost of fairness,” in 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2017.
[4] E. Saund, “saund/algorithmic-bias,” 2020. [Online]. Available: https://github.com/saund/algorithmic-bias.
[5] J. Angwin, J. Larson, S. Mattu and L. Kirchner, “Machine Bias: There’s software used across the country to predict future criminals. And it’s biased against blacks,” May 2016. [Online]. Available: https://www.propublica.org/article/machine-bias-risk-assessments-in-criminal-sentencing/.
[6] A. Flores, K. Bechtel and C. Lowenkamp, “False Positives, False Negatives, and False Analyses: A Rejoinder to “Machine Bias: There’s Software Used Across the Country to Predict Future Criminals. And it’s Biased Against Blacks.”,” Federal probation, vol. 80, 2016.
[7] W. Dieterich, C. Mendoza and T. Brennan, “COMPAS Risk Scales: Demonstrating Accuracy Equity and Predictive,” 8 July 2016. [Online]. Available: https://www.documentcloud.org/documents/2998391-ProPublica-Commentary-Final-070616.html.
[8] J. Angwin and J. Larson, “ProPublica Responds to Company’s Critique of Machine Bias Story,” 29 July 2016. [Online]. Available: https://www.propublica.org/article/propublica-responds-to-companys-critique-of-machine-bias-story.
[9] J. Angwin and J. Larson, “Bias in Criminal Risk Scores Is Mathematically Inevitable, Researchers Say,” 30 December 2016. [Online]. Available: https://www.propublica.org/article/bias-in-criminal-risk-scores-is-mathematically-inevitable-researchers-say.