Ethics for powerful algorithms (1 of 4)

COMPAS has been in the news a lot lately. It’s a proprietary algorithm widely used by judges and parole officers to set bail, adjust sentences, and determine terms for parole. If you are arrested in the U.S. today, COMPAS or an algorithm like it will likely influence if, when and how you walk free.

The idea of algorithms sitting in judgement on human prisoners is unsettling, to say the least. We’re not yet used to machines exercising that kind of power. So it’s no surprise that COMPAS landed on the front page when ProPublica published a detailed investigative article with the tagline, “There’s software used across the country to predict future criminals. And it’s biased against blacks.”

I’m a professional data scientist — I build algorithms and data systems for a living. This series is an inside look at the ethics of powerful algorithms. Here’s the roadmap:

  1. Part one is a re-examination of ProPublica’s analysis, because COMPAS isn’t biased — at least, not in the way they make out.
  2. Part two is a crash course on the values at stake in the design of predictive algorithms, taking COMPAS as a case study.
  3. Part three is a handbook for navigating tradeoffs between values when using predictive modeling and criminal justice.
  4. Part four takes a wider look at the increasing influence that algorithms have in our society, and a playbook for ensuring that they make our world better, not just more efficient.

Let’s start with transparency.

It’s fantastic that ProPublica went to the effort to vet COMPAS and start a conversation around algorithmic justice. Northpointe Inc., the company that sells the COMPAS software, didn’t make it easy. Although they published a 2009 paper on the algorithm’s effectiveness, it keeps the inner workings of the algorithm secret and addresses race only indirectly. When approached by the ProPublica team, Northpointe refused to share data, and disputed the findings.

As a result, the ProPublica team had to jump through a lot of hoops to test the COMPAS algorithm for bias. They found a cooperative sheriff’s office in Broward County, FL, filed a public records request for two years of COMPAS scores, downloaded public criminal records, and then cleaned, matched, and analyzed data for 18,610 people. This is exactly how investigative journalism should work — diving deep to find the truth in areas of society that rarely see much sunshine.

Scrappy but resourceful reporters. Shadowy, stonewalling corporation. The setup seems like a John Grisham novel, and it’s clear which side we’re supposed to sympathize with.

That’s why I was shocked when I retraced the steps on ProPublica’s analysis and discovered the truth: COMPAS isn’t statistically biased against African Americans.

COMPAS isn’t biased.

To ProPublica’s great credit, they shared all their data and analysis code on github.

That meant that it took me less than an hour to download the Broward County data and run some basic checks for racial bias. At the time, I wasn’t expecting to refute the ProPublica analysis. I was just trying to create the one graph that would provide the smoking gun to convict a statistically biased algorithm: a conditional effects plot.

I had been surprised that PubPublica left this plot out of their analysis, because it’s a simple and effective way of highlighting the bias in an algorithm. A 2016 paper by Skeem and Lowenkamp analyzing a similar algorithm (and cited in the ProPublica analysis) includes such a graph as Figure 1.

A conditional effects plot from Skeem and Lowenkamp (2016)

Let’s take a second to learn how to read these plots, so that you can evaluate the evidence for yourself.

In these conditional effects plots, offenders’ risk scores are grouped on the x-axis. Probability of a second arrest is shown on the y-axis. Four separate lines show the average rates of recidivism for all black and white offenders, and black and white violent offenders.

An algorithm is racially unbiased if and only if black and white offenders with the same risk scores recidivate at the same rate. That means that the conditional effects lines for black and white prisoners must fall nearly on top of each other. The algorithm studied by Skeem and Lowenkamp is essentially unbiased, so their conditional effects plot follows that pattern.

If the algorithm had been biased against blacks, we would have seen a signature like this. (Let’s leave out the two extra lines for violent crime — they would follow the same pattern, with different slopes.)

Here the line for blacks is pushed down and to the right of white prisoners, indicating that black and white defendants with the same risk scores recidivate at different rates. For any given level of risk, the algorithms assigns higher scores to black prisoners than white prisoners.

Most of the anecdotes in the ProPublica article paint this sort of picture: black and white criminals who appear similar but were assigned different risk scores. If that were happening systematically, it would be strong evidence of racial bias in the COMPAS algorithm.

Is that the case?

Here’s the conditional effects plot for COMPAS, generated from ProPublica’s own data:

There’s no evidence of racial bias here. The COMPAS-to-recidivism lines for black and white offenders are closely aligned over the full range of scores. If anything, there’s a slight bias against white offenders, but it’s within the margin of error for each group of COMPAS scores.

(Here’s a link to my source code, for others to replicate my analysis.)

What next?

I was very surprised at this result. Instead of finding a clearer way to illustrate ProPublica’s claim, I had inadvertently found evidence that the claim (“COMPAS is biased against blacks”) simply isn’t true.

So I checked and doublechecked my analysis. I ran it past several friends and co-workers who are well-versed in statistics. In the end, we all agreed: conditional means are the right way test for statistical bias, and the COMPAS algorithm is unbiased.

In a way, that’s fortunate, because it creates an opportunity to look at how powerful algorithms can be deeply unfair, even when they’re statistically unbiased.

Let me say that again, in bold. Because this is the conversation that we should be having on the ethics of powerful algorithms:

Powerful algorithms can be harmful and unfair, even when they’re unbiased in a strictly technical sense.

In my next post, I’ll take a look at three ways that may be happening with COMPAS.

P.S. If you want to understand how ProPublica got biased-looking numbers even though COMPAS isn’t biased, check out this google doc. Long story short, it’s a variation on Simpson’s paradox that can show up when you split a numerical risk score into arbitrary “high,” “medium,” and “low” groups.

PPS: The doc isn’t mine — I found it linked from the comments of the ProPublica piece. (Yes, I read the comments.)

PPPS: Northpointe published an academic-style rebuttal to ProPublica’s article. It’s very dense, 37 pages long, and appears to be largely correct. In response, ProPublica doubled down on their original claims. When I reached out to Julia Angwin, the lead author on the ProPublica piece, her response was a pleasant but perfunctory, “We have, in fact, thought deeply about the statistical analysis and whether we applied the correct techniques.” Not very persuasive…