This is why most criticism of risk assessment models is mistaken

Source: Wikimedia

In critiques of algorithmic risk assessments, the most irksome part is that the appraisals typically don’t go far enough. Yes, we should keep a watchful eye on the models designed to help judges, prosecutors, and court officers assess recidivism risk. Yes, there are brilliant people out there who are rightly pushing back against the cult of significance. And yes, let’s critically evaluate the effectiveness and predictions of these methods.

Lovely.

But by focusing on the statistical significance of the models, the social significance is glossed over.

What is social significance? Here is one way to think about it: If a tree falls in a forest and nobody’s around to hear it, does it make a sound? Here is another way to think about: If a risk assessment model makes a biased decision, but it isn’t an important aspect in the sentencing process, is the model socially significant?

The difference between statistical and social significance, between the model’s accuracy and the effect of the model on reality, is an endemic mistake. A professor of mine even wrote a book about it, noting that,

Statistical significance is not equivalent to economic significance, nor to medical, clinical, biological, psychometric, pharmacological, legal, physical, nor any other kind of scientific significance — those functions of gain and loss.

Debates raging over the model’s reflection of reality need to give way to debates about the use of the model in practice. Decisions at every part of the criminal justice system are not context free. They come with a variety of institutional proclivities, which often have bias built in. Are risk assessment models changing these outcomes?

Caleb Watney details some of those problems over at Brookings:

Not only are there racial disparities in the sentencing process, but research suggests that extraneous factors like how recently a parole board member ate lunch or how the local college football team is doing can have significant effects on the outcome of a decision. It may be that the tasks we ask judges and parole boards to carry out are simply too difficult for internal human calculus.

The right comparison, then, should be between the current world and one where risk algorithms play more of an important role. Of course, some have hand waved this problem by saying that these risk assessment systems just further entrench bias. The current state of criticism exhibits a bizzaro form of the nirvana fallacy. Because bias is possible, everywhere an algorithm is implemented displays bias. But there is no reason we should expect risk assessment tools to embed biased decisions. Indeed, critics of models should appreciate this nuance more than anyone else. One of the most important lessons to be drawn from understanding confounding variables is that the introduction of a new variable can change the direction of every other part. What once showed a negative sign can change toward positive with a new variable. In practice, the additional information provided by these tools might not do much to change minds one way or another.

To be blunt, few researchers have done the tough ground work needed to understand how these tools are used in practice. Angèle Christin is an exception. As a professor at Stanford, she studies the use of statistics in courts, and has been talking to people at all levels of criminal justice system to understand how they are implemented. As she recently pointed out, top administrators often praise the new methods as being mathematically rigorous, and yet judges, prosecutors, and court officers in those same system don’t:

Yet it is unclear whether these risk scores always have the meaningful effect on criminal proceedings that their designers intended. During my observations, I realized that risk scores were often ignored. The scores were printed out and added to the heavy paper files about defendants, but prosecutors, attorneys, and judges never discussed them. The scores were not part of the plea bargaining and negotiation process. In fact, most of judges and prosecutors told me that they did not trust the risk scores at all. Why should they follow the recommendations of a model built by a for-profit company that they knew nothing about, using data they didn’t control? They didn’t see the point. For better or worse, they trusted their own expertise and experience instead.

An astute observer will note, institutional momentum changes course slowly. So, perhaps these models will become important in the future. While true, the criminal justice system is no stranger to these techniques, as some assessment tools go back nearly 40 years.

In other words, the criminal justice system seems to have already adapted to risk assessment tools, and they don’t seem to put much trust in the systems.

Consider the case of Eric Loomis. This write up in WIRED follows the typical narrative trajectory:

In the case of Wisconsin v. Loomis, defendant Eric Loomis was found guilty for his role in a drive-by shooting. During intake, Loomis answered a series of questions that were then entered into Compas, a risk-assessment tool developed by a privately held company and used by the Wisconsin Department of Corrections. The trial judge gave Loomis a long sentence partially because of the “high risk” score the defendant received from this black box risk-assessment tool. Loomis challenged his sentence, because he was not allowed to assess the algorithm. Last summer, the state supreme court ruled against Loomis, reasoning that knowledge of the algorithm’s output was a sufficient level of transparency…

With these facts, or lack thereof, how does a judge weigh the validity of a risk-assessment tool if she cannot understand its decision-making process? How could an appeals court know if the tool decided that socioeconomic factors, a constitutionally dubious input, determined a defendant’s risk to society? Following the reasoning in Loomis, the court would have no choice but to abdicate a part of its responsibility to a hidden decision-making process.

On the ground, courts don’t abdicate this power. They merely dismiss the findings of the report. Why would a judge bow to the decision of some external gauge? To be fair, the Loomis case is important because the defendant isn’t being allowed to understand how the tool actually makes an assessment. Rightly there are some serious concerns of due process, which Whatney explores, as do other experts. Still, most have seized on the algorithm’s accuracy, a question entirely different from the tool’s power to sway the sentencing decision process. An important kicker to this story seems to be lost, as the New York Times noted, “Mr. Loomis would have gotten the same sentence based solely on the usual factors.”

While algorithmic tools offer the promise of better outcomes, critiques reach for the low hanging fruit. Far too often, the discussion about risk assessment has been tied to questions of accuracy. But the attention needs to coalesce around its real world importance, its social significance. Instead of asking, does it merely reflect the world, we should probe whether the models drive decisions made in the world.