Pre-trial algorithms deserve a fresh look, study suggests

Photo by Karen Neoh

What if a predictive tool used at the moment of arraignment could simultaneously reduce the number of people sent to jail before trial, reduce crime, and also reduce racial disparities in incarceration? Such a tool could be a game changer in criminal justice reform. There’s good reason to be skeptical, however: Several high-profile analyses of COMPAS — a popular, commercially available “risk assessment” tool — have claimed that the tool unfairly labels black defendants as higher risk, and it does so more often than it overestimates the risk posed by white defendants.

But a new large scale study — Human Decisions and Machine Predictions— demonstrates that it’s possible to build a predictive tool that simultaneously accomplishes three desirable goals: reducing pre-trial detention rates, reducing re-arrest rates of those released pending trial, and reducing racial disparities in which defendants are jailed. Policymakers across the ideological spectrum should be interested in this result and encourage further study.

The scenario: Before trial, judges make predictions about flight risk and public safety

After an arrest, a defendant appears at a bail hearing before a judge. The judge then has an important decision to make: the judge can release the defendant, release the defendant under certain conditions (like paying a bail amount or requiring that the defendant be monitored), or detain the defendant without the possibility of release. When making these decisions, judges generally weigh two concerns: flight risk and public safety. A judge would ask herself: Would the defendant fail to appear in court? Would he commit another crime if released? Would he be a threat to the public? All of these assessments rely on prediction.

There’s been a long-standing debate about whether pre-trial detention, cash bail, or predictions of a defendant’s future dangerousness are ever appropriate, and whether they’re unconstitutional forms of punishment. In fact, these questions were vigorously debated and litigated during the bail reforms of the 1960’s, 70’s, and 80’s.

But this paper focuses on a more incremental and quantitatively driven question: how might our existing pre-trial decision-making process — based on making predictions about a person’s flight risk and future dangerousness — be made more equitable and fair?

What the researchers did

The researchers examined a dataset of more than 550,000 defendants who were arrested in New York City between 2008 and 2013.

The researchers built a model using the data about released defendants because these data contain both case characteristics (like the current charge, the details of the defendant’s criminal history, and the defendant’s age) and case outcomes (such as whether a defendant failed to appear or was re-arrested). Once they fit the model on the training data, the researchers predicted the risk for each defendant in their test data. Then they compared their risk predictions against the real decisions that judges made.

Judges often detain low-risk defendants and release the riskiest

Perhaps unsurprisingly, judges often make mistaken predictions. The researchers found that though judges correctly release most individuals predicted to below-risk, among “the [predicted] riskiest 1% of released defendants . . . the judge[s] release them at a rate of fully 48.5%.” Judges, it seems, were treating many individuals predicted to be very high risk as if they were low risk.

One explanation might be that judges set strict bars for pre-trial detention to ensure that only the very riskiest defendants are detained. But the researchers’ simulations found another surprising result: Judges who set a higher bar for detention actually detained defendants across the predicted risk distribution and chose to “to jail many low-risk defendants ahead of those with higher predicted risk.” Indeed, “[judges] instead seem to be mis-ranking defendants when deciding whom to detain.”

Together, the researchers found that judges not only release some of the riskiest individuals, but they also over-detain many less risky ones.

Should we put the algorithm in charge?

Given these results, how could algorithmic predictions help inform pre-trial decision-making?

The researchers propose an “auto pilot” scenario for bail decisions where “a full re-ranking policy . . . let[s] the risk tool rank-order all defendants by predicted risk and make recommendations for all of the jail and release decisions.” Today, we know what it’s like when human judges make bail decisions on their own. But what if just an algorithm made bail decisions? How would things change? (In reality, it’s unlikely we’d ever fully automate bail decision-making and take judges out of the process. But the point of the paper is to demonstrate the scope of what’s possible.)

What the “auto pilot” scenario would look like

In order for an auto-pilot re-ranking simulation to work, the researchers have to confront some difficult counterfactuals. For example, a re-ranking simulation will necessarily shuffle some decisions: some people who judges actually released will be detained by the simulation, and some people who judges detained will be released. Evaluating the effect of the first case is easy to understand — another person is detained, which increases the detention rate. But it’s difficult to evaluate the second case: What would individuals — actually detained in real life, but released in the simulation — do? It’s this unknowable, alternate reality that plagues a lot of pre-trial research.

But the researchers devised a clever workaround. They “assign[ed] the same crime outcomes to . . . observationally equivalent defendants whom the judges actually released.” Essentially, they impute what will occur in their simulation based on what a similarly situated released individuals actually did. (Obviously, this requires some big assumptions.)

Turning back to the simulation, first, the researchers used their algorithm to predict each defendant’s failure-to-appear risk. (We’re focused on failure to appear, because, unlike the vast majority of states, New York state law prohibits judges from making discretionary decisions to deny bail based on public safety concerns. They can only consider flight risk.) Next, they rank-ordered all defendants by their predicted risk. Going forward, keep this incredibly simplified bar in mind:

The algorithm simply decides to detain those defendants that are predicted to be the riskiest — meaning those who are most likely to fail to appear. So, the tool would start detaining on the right side and work to the left.

Here are the results of their full re-ranking “auto pilot” approach based on NYC data:

  • If the algorithm’s release rule matches the rate at which judges presently release defendants, their algorithm is able to reduce the failure to appear rate by 24.7 percent.
  • If the failure to appear rate is held constant, the algorithm could reduce the jail detention rate from 26.4 percent to 15.4 percent — a 42 percent reduction.

That’s an improvement, but as the researchers note, we don’t care about these rates in the abstract. We care whether or not these decisions are fair and lead to a more equitable, racially just system.

Technical constraints could help ensure racial equity

The researchers then showed that it’s also possible to explicitly constrain the algorithm to “ensure no increase in racial disparities in…jail with very little impact on the algorithm’s performance in reducing crime.”

A quick orientation: The first row of the table details the racial composition of the defendant pool as they appear before judges — the “base rate.” The second row details the rate at which judges detained defendants in real life. The third row reproduces the results of the researchers’ algorithm: as mentioned, it’s able to outperform judges in lowering the crime rate by 24 percent. The final three rows test how different algorithmic policy constraints might lead to reduced racial disparities in incarceration. I’ll only focus on the last row — though the other two offer interesting alternative approaches.

Recall the blue sliding scale of risk. The tool simply makes decisions by risk in descending order. But, as the algorithm proceeds, it is now bound by new policy constraints. The final constraint, in the sixth row (Match Lower of Base Rate or Judge) works like this: as decisions are made about who to detain or release, the tool ensures that the share of the detained population that’s either Black or Hispanic is no higher than:

  • the share of minorities that judges detain today; and
  • the overall share of minorities within the given defendant pool today.

Essentially, the tool stops detaining according to whichever rate is lower. (So, for Black defendants that’s their base rate and for Hispanic defendants that’s the rate at which judges detain. The purple boxes above call these lower rates out).

This policy constraint not only leads to a reduction in failures to appear by 22.74 percent, but also lowers the minority share of the jailed population from 88.92 percent to 80.39 percent — a 9.57 percent reduction. That’s a substantial change from the status quo.

Percentages sometimes obscure raw human impact, so it’s worth remembering we’re talking about thousands of individuals that could be safely released pre-trial. Assuming there are 100,000 cases per year, this would mean approximately 9,000 cases per year where someone who would have been detained is now released.

This isn’t a silver bullet — it won’t solve other troubling underlying disparities — but it could help.

Despite these potential improvements, the disparity that many advocates are more concerned about is that final number in the first row — that the given set of defendants is 81.95 percent minority. And even in the constrained model in the last row, the jailed population is still 80.39% — an improvement over the status quo, but still a very unfairly high number. These percentages are troubling and reflect systemic biases, to be sure.

This research isn’t focused on what might lead someone to be at a bail hearing — a critically important question, but one that’s outside the scope of this paper. Instead, it lays out how we might address the racial disparities that are exacerbated by pre-trial decision-making. And that’s important in and of itself: research has consistently shown that pre-trial detention is strongly associated with a higher probability of conviction for all charge types, higher rates of detainees accepting plea deals, and increased minimum sentence lengths. Further, new evidence suggests that pre-trial release increases an individual’s probability of employment three-to-four years after a bail hearing.

So to the extent that pre-trial decisions about who we release and who we detain significantly affect case and life outcomes, we shouldn’t write off these potential reforms just because there are still yet larger reforms to be made.

Don’t forget: We’re only talking about pre-trial decisions.

One thing that’s already happening is “risk assessment decision creep”: as these types of risk assessment tools become widely adopted pre-trial, there’s a temptation to use these tools for other decisions, like in sentencing. (In Loomis v. Wisconsin, a case that’s piqued the Supreme Court’s interest, that’s pretty much exactly what happened.) But Human Decisions and Machine Predictions offers a strong argument as to why it might make less sense to use a machine learning risk assessment tool to “improve” sentencing decisions:

Key to our analysis is a focus on a specific decision — and, importantly, on a decision that relies on a prediction … [C]onside[r] a different decision that on its surface seems very similar to bail: sentencing. Recidivism, which is one of the inputs to deciding the punishment for someone who has been found guilty, can be predicted. Yet many other factors enter this decision — deterrence, retribution, remorse — which are not even measured. It would be foolish to conclude that we can improve on sentencing decisions simply because we predicted recidivism.

At sentencing, our society tries to balance a number of competing goals and principles. Just because we could theoretically predict or measure one of those goals — countering recidivism — doesn’t mean it should be elevated in the sentencing calculus.


This study demonstrates how, in theory, a machine learning algorithm deployed at bail hearings could be socially beneficial by simultaneously reducing the number of individuals incarcerated pre-trial, reducing failures-to-appear, and lowering the proportion of minorities detained pre-trial. It’s the first study I’m aware of to demonstrate this possibility.

The researchers still have another hold-out dataset of 200,000 cases from New York City to further test their findings, and I’m eager to see those results. In the meantime, policymakers should encourage further study of the researchers’ results and real-world examination of currently used pre-trial tools.