A Reality Check: Algorithms in the Courtroom

Can imperfect algorithms help address systemic inequalities in the criminal justice system? Three researchers recently took to the New York Times to argue, essentially, yes.

I admire their work and agree in principle that under certain circumstances, predictive tools that face significant limitations may nonetheless play an immediate harm-reducing role in the courtroom, especially at bail. But I worry that their essay could leave some readers with mistaken impressions about how pretrial risk assessment works today. The challenges are significantly more daunting than a casual reader of their essay might be left to imagine.

https://www.nytimes.com/2017/12/20/upshot/algorithms-bail-criminal-justice-system.html

Specifically, I want to take a closer look at three critical questions: How well does pretrial risk assessment work in practice, what do the tools actually measure, and how are the tools related to the life-shaping decisions reformers care most about?

1. Early evidence about pretrial risk assessment in practice is mixed at best.

The authors argue that implementation of a risk assessment algorithm “often yields immediate and tangible benefits: Jail populations, for example, can decline without adversely affecting public safety.”

To date, the most rigorous study on the effects of pretrial risk assessment is by Megan Stevenson on Kentucky’s much-lauded bail reforms. Stevenson found that the introduction of pretrial risk assessment at bail hearings in Kentucky did not meaningfully increase in the fraction of accused people who are released before trial.

The authors also point to New Jersey, a frequently cited success story, where they argue that “algorithmic tools … contributed to a 16 percent drop in its pretrial jail population, again with no increase in crime.” The new algorithm may indeed have “contributed,” but the pretrial jail population in New Jersey was already decreasing, even before new software was deployed. For example, from June 30, 2015 to January 1, 2017, New Jersey’s non-sentenced pretrial population decreased by 19 percent.

Similar stories — where risk assessment was introduced, but pretrial jailing was already declining before implementation — can be told in San Francisco, CA and Mecklenburg County, NC.

Further, the simple act of providing a judge with an algorithmically-derived score does not, in and of itself, combat judicial “arbitrariness.” Instead, the early evidence indicates a high rate of judicial overrides, where judges depart from the recommendations of a risk assessment tool. For example, a report by the Cook County Sheriff’s Office found that judges diverged from the recommendations of their risk assessment tool more than 80 percent of the time. Stevenson’s study in Kentucky, likewise, found that judges deviated from the recommendations of their risk assessment system more often than not.

Nor are the judges’ deviations random: Detain recommendations are usually followed, while release recommendations are often overridden. Take Santa Cruz, CA as an example. There, judges departed from the release recommendations a little more than half of the time (53 percent) but only departed from detain recommendations 16 percent of the time.

Risk assessment could turn out to be a force for decarceration. But the early evidence we have is mixed at best. As the researchers noted, better algorithms might help “counter the biases and inconsistencies of unaided human judgements.” But, again, we have little evidence to suggest that this is true today for bail decisions.

2. Measurement is important, but it’s not always clear what we’re measuring.

“First, measurement matters.”

I couldn’t agree more. Given that bail has always been a prediction, it’s important to critically examine what we are asking judges to predict. As the authors explain, simply being arrested for an offense “is not the same as committing that offense.”

But it is overly generous at our present moment to say, as the authors do, that “many jurisdictions — though not all — have decided to focus on a defendant’s likelihood of being arrested in connection with a violent crime.” It’s certainly true that during the 1970s and 1980s bail statutes across the nation were rewritten to have judges predict whether or not an arrestee presented a danger to the public. But, I don’t think it’s quite accurate to state that when it comes to pretrial risk assessment and decision-making today, jurisdictions are focused on violence.

In fact, as others have observed, the majority of today’s risk assessment tools produce a single composite score that combines the risk of a defendant (1) failing to appear and (2) being rearrested while released. That means, if you’re a judge, the score you see tells you the likelihood of either event occurring — not just the specific likelihood of future violent crime. Even for the most widely used pretrial risk assessment tool — the Arnold Foundation’s Public Safety Assessment — it’s not clear that its focus is on violent crime. The tool’s three outcome measures are failure to appear, new criminal activity, and new violent criminal activity, but in most places we have seen, it is the tool’s rearrest and failure to appear predictions (rather than the violence prediction) that jurisdictions use to develop their advice for judges.

And while the authors are careful about arrest data, they should be (but are not) similarly skeptical of failure to appear data. They write that “the act of skipping trial can be perfectly observed, which circumvents the potential for biased measurement of behavior.”

Technically, this is true. We can perfectly observe who does and does not show up for a court hearing. But a common problem with understanding “failure to appear” data is that there is a major difference between missing a court date accidentally and fleeing the jurisdiction intentionally. Someone might fail to appear to a court date because they didn’t have enough money for transportation, were never reminded of the court date by the jurisdiction, couldn’t arrange childcare, couldn’t afford to take time off from work, or for many other mundane reasons. Some might even fail to appear because county officials gave them the wrong date to appear, as is happening in Harris County.

Just as it’s important to be mindful of what rearrest does or doesn’t mean, it’s also important to be mindful of what failure to appear does or doesn’t mean.

3. Risk assessment algorithms just provide numbers — what we do with those numbers is up to us.

If a jurisdiction chooses to center their bail system on a pretrial risk assessment system, they’ll have to answer to the following question: “How much risk should our community tolerate?” That answer will more than likely get translated into something called a decisionmaking framework, which looks something like this:

Essentially, these frameworks recommend a course of action for a judge, based on a risk assessment score. The important takeaway is that no matter how well-designed or well-tested the prediction is, the tool just spits out a number. That number will have to be converted into some course of recommend action for the judge. If pretrial risk assessment systems are to drive decarceration and reduce arbitrary judicial decisionmaking, this is where policymakers can force it to happen.

Of course, these decision recommendations can change — either in favor of more release, less release, or release with further conditions — as a result of political winds. For example, New Jersey’s Attorney General recently modified its guidance so that even lower risk assessment scores can trigger the presumption of pretrial detention.

The authors rightfully observe that algorithms are not “a substitute for sound policy, which demands inherently human, not algorithmic, choices.” That’s true, but I wish they had not passed up the opportunity to highlight the importance of these human choices. As my colleague David Robinson recently observed, “[f]or all our debate over the raw numbers that come out from these risk tools, there’s very little public debate (or even public knowledge) about the frameworks, which are really where the rubber of risk numbers meets the road of actually recommending what to do for a person.”

Conclusion

Overall, I agree with much of what the authors argue for. We must be careful in constructing risk assessment algorithms, and be clear-eyed both in our choice of outcome measures, and in our knowledge of what those measures represent.

But while their analysis of how algorithms might best be designed is careful and even-handed, their description of how pretrial risk assessment tools work today is overly optimistic relative to the evidence we have so far. These tools might be a force for decarceration and judicial regularity, but there’s no guarantee that this will happen.