Let Computers Be the Judge
The case for incorporating machine learning into the U.S. criminal justice process
Judges make myriad decisions that are, at their core, predictions. For instance, when someone is arrested, a judge decides whether to detain or release the arrestee until the court date. This involves predicting the person’s likelihood of showing up for future court dates or committing a crime if released. The pretrial detention decision has important consequences, and, historically, we’ve trusted (human) judges to decide based on their wisdom and experience. But could computers do a better job?
Many believe the answer is yes. They point to methods such as machine learning that enable computers to use past behavior to “train” algorithms to predict future outcomes. In the case of pretrial detention, a computer would consider all data available about a person and generate a risk score based on past experience with similar defendants. The potential of machine learning is exciting, and this method is used in a variety of criminal justice contexts, from hot-spot policing to sentencing.
Others note, however, that the data computer algorithms use could bake existing biases into future court actions. For instance, if black men are more likely than white men to be arrested for criminal (or noncriminal) activity, then algorithms that consider arrest records will amplify that bias. Furthermore, algorithms are typically developed by private firms that don’t reveal their methods, so we can’t even see what information is used in the predictions. Firms aren’t allowed to use race as a predictive variable, but what if they use information that is highly correlated with race, such as employment status? Does that put black defendants at an unfair disadvantage? Social justice advocates spend lots of time wringing their hands over whether machines are biased.
Human Biases Emerge in Surprising Ways
But that discussion misses an important point: Humans are biased. We routinely get things wrong due to an array of cognitive biases. Judges’ decisions are affected by factors such as whether they’re hungry (decisions are far less favorable just before lunch) and if their local football team won that weekend (unexpected losses result in longer sentences, especially for black defendants). Judges surely consider much of the same data — like arrest records and employment status — that advocates worry about. And like other actors in the criminal justice system, including police officers and juries, judges can be racially biased.
David Abrams, Marianne Bertrand, and Sendhil Mullainathan showed that race plays a significant role in judges’ sentencing decisions. In a context where defendants were randomly assigned to judges, the defendants’ characteristics should have been similar across judges, and so any black-white gap in guilt and risk levels (and, consequently, sentencing) should have been similar as well. But the researchers found that black-white differences in incarceration rates varied significantly depending on the judge. At least some judges were clearly considering race when deciding to incarcerate someone.
Along similar lines, Crystal Yang showed that when judges were given more discretion in sentencing after federal sentencing guidelines were struck down in United States v. Booker, racial disparities in sentencing increased. After Booker, black defendants received an average of two more months in prison than their white counterparts. This implies that limiting judges’ discretion helped achieve more racially equal outcomes.
Reducing Bias with Algorithms
It’s possible that an experienced, thoughtful judge can overcome these biases and get it right more often than a computer would, particularly when the data about someone’s history and crime don’t tell the full story. So, it’s not obvious that computers will make better predictions than judges currently do — but there’s undoubtedly room for improvement. Given the many documented flaws in human decision-making, the relevant question is not whether machines are biased. It’s whether machines are less biased than humans — that is, whether they’re able to make more accurate predictions. And the answer to that question appears to be yes.
In new research, Jon Kleinberg and colleagues created a machine-learning algorithm to predict which arrestees in New York City should be detained pretrial. The researchers then considered whether their algorithm got it right more often than real-world judges did. This sounds like a straightforward exercise, but it’s actually pretty complicated, because the data doesn’t tell us what would have happened if a person who was detained had been released instead.
Simply comparing individuals who are detained with similar-looking people who were released won’t address this problem: There’s probably a reason the person was detained, even if we can’t see it in the data. Perhaps he was more confrontational in the courtroom, or the facts of his case were slightly worse. It’s possible the judge used this additional information to correctly sort defendants based on their likelihood of behaving well (or badly) if released.
So, how can we tell if a detained person should have been released and, therefore, whether the judge (or computer) got it right? We need some way of knowing the “counterfactual” for those who were detained — what would they have done if released? The problem for the researcher is that detention decisions aren’t random. If pretrial detention had been randomly assigned, we could use released defendants as counterfactuals for detainees without worrying about any unobservable differences between them that drove the detention decision.
It turns out that in New York City, defendants were randomly assigned to judges. Judges vary widely in their likelihood of detaining people pretrial — some judges are more lenient than others — so random assignment to judges is almost as good as random assignment to pretrial detention. If one defendant was assigned to a lenient judge and released, while a similar-looking defendant was assigned to a harsh judge and detained, we can reasonably assume that the difference in their detention decisions was due to the judges, not to something about the two defendants that we can’t see. We can then use released defendants as counterfactuals for similar-looking detained defendants.
With this in mind, the researchers compared predictions for similar-looking defendants who were assigned to different judges to estimate what would have happened if detained defendants had been released instead.
The researchers first used their machine-learning algorithm to generate risk scores for both released and detained defendants. They trained the algorithm on the released defendants’ outcomes, since those are the only outcomes available in the data, but random assignment to judges allowed them to use this algorithm to predict the risk of detained defendants as well. They then considered which defendants specific judges detained. If judges are good at predicting future behavior but simply differ in their leniency, we would expect a slightly harsher judge to detain the same high-risk people that a more lenient judge detained, but also some lower-risk people. That is, if we imagine all defendants ranked by risk level, we’d expect harsher judges to detain people farther down the list. In fact, however, the additional people detained by a slightly harsher judge were pulled from throughout the risk spectrum, seemingly at random.
Judges seemed very bad at predicting defendants’ risk levels. They routinely detained and released the wrong people.
The researchers estimated that replacing judges’ decisions with the computer’s risk assessments could substantially improve outcomes without detaining anymore people: The computer’s detention decisions would have resulted in 25 percent fewer crimes committed by defendants. Alternatively, we could maintain current crime rates while detaining 42 percent fewer people, thus reducing incarceration rates. These improvements were possible even when racial disparities were held at or below current levels. All around, the algorithm was better than human judges at predicting who should be detained and could be used to reduce racial inequities.
This is very different from the conclusion of a recent article published on ProPublica, which boldly claimed that a popular risk-assessment algorithm is wildly inaccurate and biased against black defendants. Many people, including me, criticized the piece. The biggest problem was that the authors compared individuals’ risk scores with post-sentencing outcomes but didn’t know the counterfactual: What would have happened if a “high-risk” defendant was given a low risk score and set free? If a person labeled high risk did not reoffend, perhaps it was because he was detained pretrial or received a longer sentence. In that context, there’s no way to know if the algorithm got it right or wrong or if it did worse than a judge would have. Kleinberg et al. get us closer to that answer, and their conclusion is much rosier.
In actual courtrooms, these algorithms aid judges’ decisions, rather than replace them. So what happens when a judge uses the risk score as part of the decision process? Does it change the judge’s behavior for the better — and if so, how much? We don’t know yet. To test this, we’ll need random assignment of this policy tool across courtrooms to see how outcomes like recidivism and racial disparities differ across judges who saw the risk score and those who did not. The latest research shows that computers can predict defendants’ behavior better than judges can. Let’s find out how much this helps in practice.