TLDR; Achieving calibration at the group level is easy. This does not necessarily imply, however, that any group is homogeneous, or that group-level probabilities apply to the individual in each group.
1. Predicting recidivism
By now, surely everyone who is interested in fairness and transparency in machine learning is familiar with the ProPublica story about algorithms used in criminal justice. To summarize very briefly: there is a widely-used tool called COMPAS that is used to score people who have been convicted of a crime in terms of their risk of recidivism. By pulling the records of around 10,000 people (before filtering), ProPublica found that although the algorithm was equally accurate across racial groups at predicting who would be re-arrested, it tended to produce many more false positives among African Americans, and many more false negatives among whites. There are many interesting related threads to be followed up on, including a proof of why this sort of difference at the group level is practically inevitable, similar work at a much larger scale specifically focused on “failure to appear”, and many explorations of the meaning of fairness. Here, however, I want to focus on a different aspect of this story, namely: how should we think about the performance of these kinds of systems at both the group and the individual level?
To begin, we should note, as ProPublica did, that if we treat it as a classifier, the overall accuracy of the COMPAS system is quite bad. Using the scores from the system to divide people into low, medium, and high-risk categories, ProPublica converted these classes into a prediction about recidivism.¹ They then applied this to their sample, and checked to see who had in fact been re-arrested within 2 years. This procedure results in an overall accuracy of approximately 67%.
Just think about that for a second: It is basically saying that for every two people that the system is correct about, it will be wrong about one person. Moreover, accuracy by itself doesn’t directly tell us what kinds of mistake the system is making; in this case it is a combination of both false negatives and false positives, with the racial discrepancies noted above.
In the meantime, there has been some push back about whether this is an appropriate way to interpret the scores from the system (which I will revisit below), but ProPublica’s results are nevertheless within the ballpark of results from a previous study by the founder of Northpointe.
Further, if we look at the overall proportion of people who were re-arrested in their sample population, (excluding things like traffic offenses), that number is about 46%. Thus, simply predicting that no one would be re-arrested would achieve an accuracy of 54%. For women, the overall rate of re-arrest is much lower, thus the most-common-class prediction for women would result in an accuracy of 65% for that group. Indeed, it turns out that a very simple set of features can be used to produce a classification decision that achieves equal performance levels. As Dressel and Farid reported, a shallow decision tree based on just the person’s age, and their number of priors arrests is sufficient to match that level of accuracy. (There is almost certainly some amount of overfitting here, but the point is that the accuracy is the system is comparable to a very simple baseline).
In a seemingly hastily-written response, Flores et al. push back against the claims of ProPublica, arguing that the system is actually not biased, as evidenced by the extent to which it is well-calibrated, and similarly calibrated across groups. Moreover, they assert that “COMPAS was not made to make absolute predictions about success or failure. Instead, it was designed to inform probabilities of reoffending across three categories of risk (low, medium, and high).” In other words, Flores et al. don’t think that we should convert these probabilities into predictions or talk about accuracy. Rather, they claim that the system is well-calibrated, and that if a person is assigned to one of the three risk categories, we can treat them as having a certain level of risk.
In principle, this sounds good — a reliable probability is better than a prediction, and if the system is well-calibrated, then that makes it seems like we should be able to trust these probabilities. Indeed, there are certainly cases where what we want is precisely a well calibrated prediction for a specific group. But there is one big problem with this: just because a system is well-calibrated, doesn’t mean that it’s predictions can be trusted at the level of individuals.
2. Calibration and refinement
What is calibration exactly? Well, the idea is that when we make a prediction in terms of a probability, that probability should relate to average number of times that we are correct over the long term. This is perhaps simplest to think about in terms of weather forecasting. If I say that there is a 70% chance of rain tomorrow, you can’t evaluate how good that prediction was just based on whether or not it rains tomorrow. (It will either rain or it won’t, and that won’t tell you if the probability was 70% or 80% or 5%). However, over the long term, if you look at all the days for which I said there was a 70% chance of rain, you can ask: on how many of those days did the rain in fact materialize? If it rains on about 70% of those days, then you can say that I am well calibrated.
It turns out, however, that calibration by itself is trivially easy to achieve. All I would need to do is look at the weather over the past few years, and count the proportion of days on which it rained. I could then simply divide that by 365, and predict the resulting probability every single day.
You’re probably thinking that that will likely be a terrible prediction. We know that the probability of rain varies with many factors, such as the time of year, and the presence of clouds in the sky, and any prediction which ignores these will fail badly on many days. And that is true! But calibration doesn’t care about that. All it cares about is the proportion of days that it actually rains over the long term.
In other words, predicting the long run average treats all days as if they were the same, and just predicts the average for that population. Now, the fact that rain is variable does mean that we need to be sure to get a reasonably balanced sample. We couldn’t, for example, only look at days the summer, or take statistics from a different city. But as long as the number of days of rain per year is relatively constant over time (an increasingly questionable assumption!) then the historical average will tend to be well calibrated.
In fact, if we have any well defined population or random process, such as flipping a coin, the central limit theorem tells us that a relatively small number of samples will give us a pretty good estimate of the underlying mean. In other words, it doesn’t take that many samples to get a fairly well-calibrated estimate. For example, if we flip a fair coin ten times (such that the probability of heads is 50% each time), the probability that the proportion of heads is within 0.1 of the true probability (i.e., we get 4, 5, or 6 heads out of 10 flips) is approximately 66%. If we increase this to 1,000 flips, the probability of being within 0.1 goes up to 99.99%, and we’d have a 50% chance of being with in 0.01 of the true value. If the coin is biased, it actually becomes even easier to estimate this probability (i.e, we will be more accurate with fewer samples).
Returning to the weather, however, a simple long-run average is clearly not what we want from a weather forecast. Rather, we want a prediction that takes into consideration the relevant factors, such as time of year, whether it rained the day before, etc. (Real weather forecasting models, of course, make use of complex environmental simulations to predict forward in time). What we are truly after is a prediction that will be confident where possible, and not confident when things are uncertain.
This idea is nicely captured by something called the calibration-refinement decomposition. Specifically, it turns out that any proper score rule (such as log loss, or Brier score), can be decomposed into two terms referred to as calibration and refinement. Calibration is the long run accuracy of probability estimates, as described above. Refinement basically measures how close the predicted probabilities are to 0 or 1. Moreover, there is an obvious trade off between these two. We can always be perfectly calibrated by always predicting the long-run average (but this will have terrible refinement), or we can obtain perfect refinement by always predicting 0.0 or 1.0 (but this will be very badly calibrated). When we optimize for a conventional objective such as accuracy, we are actually optimizing for a trade-off between these two.
In other words, probabilistic classifiers try to partition the sample into groups, such that most groups have calibrated probabilities close to 0 or 1. Methods such as decision trees do this quite explicitly, by partitioning the space, and estimating the probability of outcomes for each leaf. Methods such as logistic regression do this implicitly, by mapping all points in the input space to a probability. All points which map to the same number are therefore implicitly being grouped.
In order for this to work, the classifier needs to group together individuals that are at least somewhat similar in terms of their true probabilities, otherwise the probability for each group would tend to be similar to the overall group average. Of course, since we don’t know the true probabilities for any individual, the observed features and observed outcomes in the training set are used as a surrogate. The key, however, is that even though training a classifier may be effective at making predictions, and may even be calibrated in terms of predicted probabilities, that does not necessarily mean that we can trust the probabilities assigned to individuals.
3. Two-headed coins
To illustrate why we should think about calibration as a property of the group, not the individual, consider the following thought experiment. First, imagine trying to estimate the probability that a single coin will come up heads. This is an easy thing to estimate (for the reasons given above) by simply flipping the coin repeatedly and counting the average proportion of heads.
What if we instead had a large collection of coins? Well, if we are allowed to flip each one many times, then we can estimate the probability for each one individually in the same way as above. The probability for the group of coins as a whole would then be the average of all the individual probabilities. This would correspond to the probability that a coin chosen randomly from the group would come up heads.
But what if we are only able to flip each coin once? In that case, we cannot effectively estimate the probability of any individual coin, (because we only observe a single heads or tails for each), but we can still estimate the group level probability quite effectively, because the randomness will average out across the group.
If we knew more about this collection of coins, for example, if they all looked identical, we might assume they all had the same underlying probability, and we might apply the group-level estimate as our estimate for each individual coin. However, in the absence of such prior knowledge, if we are only allowed to flip each coin once, there is no way to distinguish the scenario in which all coins have a 50% probability of coming up heads, from the scenario in which half the coins always come up heads and half always come up tails.
If the coins are not identical, we would probably begin looking for similarities among them, for example, grouping together all coins with the same value, or all the coins from the same country, and then estimating a probability for each group separately. This might help us achieve a better (more accurate) prediction for each group, but it does not solve the problem of knowing whether or not a group is mixing together coins with different probabilities, unless we find groups of coins that always come up heads or always come up tails. The only way to be able to say something definitive about the probability for an individual coin is to flip it many times.
To return to COMPAS, we unfortunately don’t really know how the model was constructed. However, assuming it was trained using standard statistical procedures, it was implicitly optimized for a tradeoff between calibration and refinement. Specifically, it would have sought a way of categorizing people along a spectrum (or into groups), such that it was possible to make well-calibrated predictions that are as close as possible to 0 or 1 for as many people as possible.
In the case of COMPAS, the system has clearly found something slightly better than just randomly partitioning people into groups, as the empirical probability of being re-arrested for the low-risk group was found to be lower than the medium group, and the probability for the medium-risk group is lower than the high-risk group.
However, all of these probabilities are very far from 0 and 1, which is why accuracy is so low overall. As such, there is little reason to believe that the groups that the system came up with are particularly homogeneous. It could be the case that every individual in each group has that same probability of being re-arrested within two years, but it seems much more likely that each group mixes together people with very different probabilities.
The main point here is that given a sample which is reasonably representative of a larger population, it is trivial to divide that population into groups (e.g. low, medium, and high-risk), and to obtain a well-calibrated probability estimate for each. Doing so, however, tells us very little, about whether those groups collect together people that are actually similar to each other in a meaningful way.²
It may be a cliche, but it is nevertheless the case that all individuals are different. Reducing people to a feature vector with 137 yes/no questions, and then to a single probability, may make it look as though some people are the same or very similar to others, and it is tempting to use this similarity to group people together in ways that allow us to predict individual behaviour on average. But this naturally risks missing nuance that is not captured by the features that have been used, and the way in which that reduction to a single number is collapsing together people who may be very different from each other.
Predicting who will be re-arrested is clearly not an easy task, and this should not be surprising, given the complexity of forces involved. If it were easy, we would expect the accuracy of statistical models to be much higher; the ideal would be to have exactly two categories, one of which had probability 0, and the other of which had probability 1. Barring this, we must remember that the probabilities given for each risk category are properties of the group, not necessarily of the individual. Being categorized as high risk means that you are similar (on paper) to others individuals in the past, a certain proportion of whom were subsequently re-arrested. Unless we only care about the group level performance, however, calibration is not necessarily a useful metric, as it does not tell us much about the correctness of the system at the individual level.
Obviously there are many situations where a decision must be made, and judges are clearly running some sort of algorithm in their head. Moreover, judges are presumably flawed in many ways, both in terms of bias, and in terms of being poorly calibrated (assuming they don’t receive a lot of subsequent feedback about the people that they do or do not release from jail). So there is definitely room for improvement in the system. However, we should not expect a tool such as COMPAS to be a “solution” to the various criminal justice problems that exist. Rather, we can think of it as summarizing the past behaviour of the system (including both citizens and law enforcement), and recognize that the real work that needs to be done is likely outside of the space of these sorts of decisions, specifically in the realms of issues like education, poverty, health care, and inequality.
 One could also create a classifier by thresholding the decile scores, but the results are similar. ProPublica also ran additional models which I do not discuss here. Please see their full discussion for more detail.
 To their credit, the authors of the Practitioner’s Guide to COMPAS Core make this same point: “Risk assessment is about predicting group behavior (identifying groups of higher risk offenders) — it is not about prediction at the individual level. … Our risk scales are able to identify groups of high-risk offenders — not a particular high-risk individual.” Nevertheless, it seems inevitable that users of COMAPS will interpret the classification of an individual as low, medium, or high risk as an actionable characterization of that individual.