PART 3: Predictive Policing, and Algorithmic Transparency as Anti-Discrimination

The problem with hard-coding and relying on historical data: unpacking case studies

At a generalized level, it is apparent across algorithms, regardless of level of complexity, that designers need to “hard code” some sort of classifier(s) in. Officers deciding where to patrol, for example, often use less sophisticated techniques to confirm hot spots that are actionable. One heuristic they may use is to “eyeball” said areas using their prior experience dealing with crime in specific neighborhoods. Regressions, which establish relationships between variable inputs and their forecasts, also need to limit and specify the number of factors that it considers –otherwise, they would just output noise, rather than any meaningful relationship between its variables. These algorithms might utilize stepwise regression to maximize their accuracy (foregoing potential causality of the variables included in its model).[21]

More complex regressions — like classification[22] and clustering[23] models — may rely on combining simple predictive models (like linear regressions), or on increased computational data-mining to reduce the amount of input variables and choose from a select few. The complex predictive policing algorithms that we may call black boxes are on the extreme end of this data-mining, ensemble model spectrum; their outputs are less interpretable by a human, save for a “limited readout” of which variables had lighter or heavier weights in the model[24].

Questionnaires are a simpler, more static, algorithm to assess risk, and they demonstrate this emphasis on “hard coding” and its fallibility. One such questionnaire that draws upon this is the Risk-Needs-Responsivity model (RNR)[25] — the leading paradigm in risk-needs assessment in the United States. It can be potentially used at all stages of the criminal justice lifecycle — from pretrial detention to parole boards and releasing authorities — and relies on its three principles to accomplish its objective of assessing causes of criminal behavior and prescribing appropriate correctional treatment:

  1. Risk principle: Criminal behavior can be predicted, and risk assessment should match the offender’s treatment.
  2. Needs principle: Productive treatment will focus on needs / risk factors that strongly correlate with criminal conduct.
  3. Responsivity principle: Rehabilitation programs should match the offender’s ability and learning style — and should take the offenders’ relevant characteristics into consideration (i.e. motivations, gender, ethnicity, etc).

The inability of such models to prescribe predictive power to all potential variables, however, is what leads to this forbidden proxy with race for many of the elements incorporated into these algorithms. One of the main criticisms of the RNR model is that the “inclusion of a bevy of dynamic risk factors has diluted the ability of risk and needs assessment instruments to classify cases accurately”; critics argue that risk and needs should be separated in such algorithms for improved efficiency[26]. Similarly, improperly scoping hot-spot size can exclude areas of interest, or be too large to be patrolled[27].

It is thus by necessity that designers deliberately scope their variables to only include the most relevant and the most predictive — the inputs that will guarantee the most accuracy for their models. It creates a resource trade-off, where restricting the number of variables to the ones that are most relevant increases interpretability and accuracy. An increased number of variables makes the output more complex and less interpretable — they render it more opaque.

But what determines relevancy? If relevancy is defined by the predictive power of the variables, and their ability to improve the accuracy of the forecast, such constraints enable the potential for the model’s over-reliance on group characteristics to enter, introducing latent discrimination. The RNR model is criticized for exactly this. Some may argue that the data it produces can help criminal justice professionals assess risk, but the core concern remains, regarding the “[impossibility of making] a determination about how individual members of a group will behave” based solely on generalized predictive group characteristics[28].

Diagnostics grounded on cut-off rates for white inmates, for example, might under or over-classify minorities; the fate of the latter, who have different needs because of their lived experiences, hinges on this classification and prediction. As one researcher puts it:

“The logic of risk assessments is premised on a set of assumptions that underestimates and devalues social and racial inequality, while simultaneously decontextualizing the criminal act, the offender, and his or her history, and needs to be carefully considered before these tools become further embedded into sentencing practices […]both static and criminogenic risk variables cannot be easily abstracted from the sociopolitical, economic, and cultural specificity of individuals”.[29]

This reliance on a potentially biased methodology and dataset, restricted to only relevant variables, is concerning. Historically, arrests and the use of force by police officers have disparately and negatively affected people of color. This consequently distorts crime statistics, a warping reflected both in high profile cases (Philando Castile, among other recent shootings of unarmed black men) but also extending to other countless, nameless instances. In August, the Justice Department noted in a lengthy report on Baltimore law enforcement that the black citizens of Baltimore were disproportionately affected by arrests on discretionary offenses[30]. A Stanford study showed similar patterns of disparate treatment in traffic stops and handcuffings in Oakland in 2013 and 2014[31].

However, increasing the number of variables that an algorithm relies upon — and conversely, decreasing interpretability — is similarly not a solution to the race problem.

Richard Berk might argue that avoiding the prickly topic of race as a factor makes the science of predictive policing an impossibility. His position is that it does not matter whether race is “hard-coded” into the algorithm because the decisions that it makes are “too complicated” anyways. Speaking about the possibility of a racial element in capital punishment, he furthermore implied that the decision to incorporate race as a variable by the algorithm would have higher predictive power with more data[32]. The Shanghai University researchers also tout this infallibility of data, saying that:

“Unlike a human examiner/judge, a computer vision algorithm or classifier has absolutely no subjective baggages, having no emotions, no biases whatsoever due to past experience, race, religion, political doctrine, gender, age, etc., no mental fatigue, no preconditioning of a bad sleep or meal.”[33]

The above line of reasoning approaches the form of our prior mentioned “could-is” fallacy. Researchers and programmers presumably vet their datasets; the algorithm is designed to be as efficient and as optimizing as possible — what’s the issue? Are critics seeing issues in the algorithm where there are none, merely because of their preconceived notions that race and gender, among other prohibited characteristics, are utilized to assess risk? What if these outputs increase accuracy overall, and by altering these models we undermine the progress made in better sentencing?

Though it is true that subjectivity sometimes gets the best of our judges[34], the idea of the “objective” algorithm disregards the possibility that their designers, who trained it based on pre-selected predictors, are fallible and prone to their own bias. The seduction of large-scale data pre-disposes its users and designers towards seeing humans as objects / data points first, and creates a false sense of security in data and the algorithm. The intuitions established based on this data, the conclusions drawn from these algorithmic outputs — these are all predicated on a false and implicitly biased sense of objectivity.

Consider that artificial intelligence systems “learn” based on the data that they are fed. These systems are making recommendations regarding “hot spots” and likely re-offenders based on finding statistical patterns, building regression and classification models (Perry et al., p. 22, 35–36). The patterns generated by these systems, identifying these hot spots (to consequently send patrols to) can themselves be informed by implicit bias. While historical patterns might be useful for a system that identifies housebreaking patterns[35], they are dangerous in the context of hot spots, and violate the rights of the individuals in these zones.

Such hot spots are often designated by officers in advance to build these regressions and classification models (Perry et al., p. 22). It should be evident why such a design choice is concerning, especially given that some officers already have a “predictive map stored in [their] brain[s]”[36]. The example of the Chicago PD is perhaps even more alarming; they constructed a “Strategic Subject List” of people likely to be victims of or committers of violent crimes based on network analysis — the people they associate with[37]. Given that social media information might even be tapped as useful for training predictive policing AI systems[38], individuals might feel as though they must control their behavior everywhere, simply by virtue of the lack of transparency regarding what information is being collected on them, and how it is being used against them.

Risk assessment algorithms thus are not impervious to bias — and thus, the interpretability of the algorithmic outputs matters to establish accountability for these algorithms. Previously, we discussed the challenge in constraining the number of elements that factor into a risk assessment model to create clear and transparent results. Consider that even without directly drawing upon prohibited characteristics like race or class, criminogenic factors proxy heavily anyways with them — to the point where the model cannot help but reflect these biases (Figure 1). If we cannot understand the procedural outcomes because of their opacity, we lose another way to keep algorithms accountable and fair.

(Figure 1. An online interactive visualization from FiveThirtyEight demonstrating how age and education proxy with group characteristics like race and income[39]).

The issue of transparency is thus a significant one. Its absence prevents the defendant from being aware of the factors a risk assessment algorithm considers to assess said defendant’s risk. Transparency ensures that, at the very least, an algorithm’s users, subjects, and designers meet some minimum threshold of accountability — an especially significant marker if the algorithm contributes towards larger steps in criminal justice cycle, amplifying its effect.

Eric Loomis, a defendant in Wisconsin who pled guilty to automobile theft and evading police in 2013, advocated for exactly this. Judge Scott Horne increased his sentence by two and a half years on the basis that “[his] history, [his] history on supervision, and the [COMPAS] risk assessment tools that have been utilized, suggest that [he is at an] extremely high risk to re-offend”[40]. While the court argued that the score produced by the Northpointe risk assessment tool alone was not “determinative”, and that given the use of other “independent factors” (listed above), that Loomis’ right to due process was not violated.

The issue here with Northpointe and Loomis’s case is again one of transparency and of influence. While the court may have successfully established its refusal to rely solely on the proprietary software, it also was charmed in part by the ostensibly scientific validity communicated by the Northpointe algorithm. The court argues that a consideration of COMPAS is permissible, but not a “reliance” on COMPAS (taken to mean solely). Its commentary reflects this dual enchantment with the algorithm’s “efficiency”, and the lack of knowledge to seriously consider its limitations beyond an acknowledgement that COMPAS had them:

Likewise, there is a factual basis underlying COMPAS’s use of gender in calculating risk scores. Instead, it appears that any risk assessment tool which fails to differentiate between men and woman will misclassify both genders. […] Thus, if the inclusion of gender promotes accuracy, it serves the interests of institutions and defendants, rather than a discriminatory purpose.
[T]his court’s lack of understanding of COMPAS was a significant problem in the instant case. At oral argument, the court repeatedly questioned both the State’s and defendant’s counsel about how COMPAS works. Few answers were available. [41]

Loomis’s fear that the algorithm has indelibly impacted his sentencing is somewhat validated by similar corroborating accounts regarding its influence. Consider, for example, Paul Zilly, who had his plea deal overturned by Judge James Babler and turned into a two-year sentence[42] on the basis that his Northpointe recidivism score was “about as bad as it could be”[43]. Such stories reflect the knowledge and transparency gap to seriously critique these algorithms, requiring outside actors to conduct collaborative audits for us.

When the results of such assessments are made public, they can often be alarming. Propublica’s audit of the COMPAS algorithm, for example, presents a model where the false positive rate for white subjects to be marked low risk was twice as high as for black re-offenders (48 to 28 percent). For “violent recidivism”, black defendants were again twice as likely to be falsely classified as higher-risk recidivists, and white defendants were falsely classified as low-risk 63.2 percent more often[44].

Link to Part 2
Link to Part 4


References

[21] Perry, Walter et al. Predictive Policing: The Role Of Crime Forecasting In Law Enforcement Operations. RAND Corporation, 2013. p. 27–30.

[22] Predict a category for an outcome.

Ibid, p. 35.

[23] Predict by “clustering” a future situation with a pre-identified set of situations.

Ibid, p. 35.

[24] Ibid, p. 36.

[25] James, Nathan. Risk and Needs Assessment in the Criminal Justice System. Rep. no. R44087. Congressional Research Service, 13 Oct. 2015. <https://fas.org/sgp/crs/misc/R44087.pdf>.

[26] Ibid, p. 10.

[27] Perry, Walter et al. Predictive Policing: The Role Of Crime Forecasting In Law Enforcement Operations. RAND Corporation, 2013. p. 20.

[28] Ibid, p. 9.

[29] Hannah-Moffat, Kelly. “Actuarial sentencing: An “unsettled” proposition.” Justice Quarterly 30.2 (2013): 270–296. <http://www.tandfonline.com/doi/abs/10.1080/07418825.2012.682603>.

[30] For more on this, read the Justice Department’s report on the Baltimore Police Department’s practices. A salient statistic:

African Americans accounted for 91 percent of the 1,800 people charged solely with “failure to obey” or “trespassing”; 89 percent of the 1,350 charges for making a false statement to an officer; and 84 percent of the 6,500 people arrested for “disorderly conduct.” (7)

Investigation of the Baltimore City Police Department. Rep. U.S. Department of Justice, 10 Aug. 2016. Web. <https://www.justice.gov/opa/file/883366/download>.

[31] Parker, Clifton B. “Stanford big data study finds racial disparities in Oakland, Calif., police behavior, offers solutions.” Stanford News. Stanford University, 15 June 2016. <http://news.stanford.edu/2016/06/15/stanford-big-data-study-finds-racial-disparities-oakland-calif-police-behavior-offers-solutions/>.

[32]Johnson, Greg. “Q&A with Richard A. Berk.” Penn Current. University of Pennsylvania, 15 Dec. 2011. Web. <https://penncurrent.upenn.edu/2011-12-15/interviews/qa-richard-berk>.

[33] Wu, Xiaolin, and Xi Zhang. “Automated Inference on Criminality using Face Images.” arXiv preprint arXiv:1611.04135 (2016). <https://arxiv.org/pdf/1611.04135.pdf>.

[34] Deruy, Emily. “Judge’s Football Team Loses, Juvenile Sentences Go Up.” The Atlantic. The Atlantic Monthly Group, 7 Sept. 2016. <https://www.theatlantic.com/education/archive/2016/09/judges-issue-longer-sentences-when-their-college-football-team-loses/498980/>.

[35] Rudin, Cynthia. “Predictive Policing: Using Machine Learning to Detect Patterns of Crime.” WIRED. Conde Nast, n.d. <https://www.wired.com/insights/2013/08/predictive-policing-using-machine-learning-to-detect-patterns-of-crime/>.

[36] Chammah, Maurice. “Policing the Future.” The Marshall Project. The Marshall Project, 3 Feb. 2016. <https://www.themarshallproject.org/2016/02/03/policing-the-future#.ejqhOmft7>.

[37] Hvistendahl, Mara. “Can ‘predictive policing’ prevent crime before it happens?” Science. American Association for the Advancement of Science, 28 Sept. 2016. <http://www.sciencemag.org/news/2016/09/can-predictive-policing-prevent-crime-it-happens>.

[38] Young, Sean. “Social Media Will Help Predict Crime.” The New York Times. The New York Times Company, 18 Nov. 2015. <http://www.nytimes.com/roomfordebate/2015/11/18/can-predictive-policing-be-ethical-and-effective/social-media-will-help-predict-crime>.

[39] Barry-Jester, Anna Maria, Ben Casselman, and Dana Goldstein. “Should Prison Sentences Be Based On Crimes That Haven’t Been Committed Yet?” FiveThirtyEight. ESPN, 4 Aug. 2015. <https://fivethirtyeight.com/features/prison-reform-risk-assessment/>.

[40] State v Loomis. Supreme Court of Wisconsin. 13 July 2016. FindLaw. Web. <http://caselaw.findlaw.com/wi-supreme-court/1742124.html>.

[41] Ibid.

[42] After Northpointe founder Tim Brennan testified, the sentence was reduced to 6 months.

[43] Angwin, Julia, Jeff Larson, Surya Mattu, and Lauren Kirchner. “Machine Bias.” ProPublica. Pro Publica Inc., 23 May 2016. <https://www.propublica.org/article/machine-bias-risk-assessments-in-criminal-sentencing>.

[44] Angwin, Julia, Jeff Larson, Surya Mattu, and Lauren Kirchner. “How We Analyzed the COMPAS Recidivism Algorithm.” ProPublica. Pro Publica Inc., 23 May 2016. <https://www.propublica.org/article/how-we-analyzed-the-compas-recidivism-algorithm>.

Show your support

Clapping shows how much you appreciated David Chang’s story.