Ethical Hurdles to Combating Racially Biased Police Algorithms
Adjusting the weight of variables for algorithms used in criminal justice could be promising, or problematic.
At the conclusion of the 2019 calendar year, Jeff Pastor, Councilman for the City of Cincinnati, Ohio, called for a sweeping review of the city’s racial disparities in policing. Pastor implored the City for an evaluation in the wake of a much publicized report conducted by The Stanford Open Policing Project. The report found significant discrepancies in traffic stops in the city, including a finding that “Cincinnati police make 120% more total stops per resident in predominantly black neighborhoods than white ones.”
The call for review reinvigorated debate among both police accountability activists and public policy circles alike. A chief concern which continues to evolve, is what is to become of potentially biased police data as predictive policing and other data driven strategies become more commonplace? If police interactions are skewed, that data, once inputted in a predictive model can result in a biased output, directing police towards biased behaviors, even without the officer’s knowledge or compliance. As the now cliché saying goes, “garbage in, garbage out”.
Fixes Aren’t Exactly Simple
Academics, activists, and hobbyists have already made great strides in bringing the issue of algorithmic fairness to the forefront of mainstream attention. Facial recognition proliferation has been stopped in various locales until reasonable safeguards can be developed. This came after racial disparities were publicized, in part due to a.i being trained on a homogeneous data set that did not include individuals with darker skin tones.
But creating inclusive data sets as a fix is not always applicable, especially when it comes to predictive policing based on historical police data. Creating an ethical framework is difficult. With predictive policing, it is not data scientists creating the data sets. It is not solely an issue of a set that is too homogeneous, rather the problem is that the historical data that is available is potentially tainted prior to being fed into any predictive model. A statistical model could adhere to the highest standards of scientific scrutiny, but that might not matter when individuals on the ground can introduce biased data.
Built In Expectations
This creates the issue of how much data scientists should be playing “preemptive defense”, so to speak.
Some months ago, the Students of Color in Public Policy symposium was held at the UC Berkeley Goldman School of Public Policy, with the topic of Race in Artificial Intelligence as one of the headlining sessions. The panel consisted of policy experts investigating the potential harms of algorithmic unfairness and the ongoing efforts on the part of activists to make government systems more transparent.
The panel discussed a 2017 Stanford study of racial disparities in Oakland Police officer’s use of language. Researchers in this study used a computational linguistic model to identify speech patterns. The key point was that the model could detect with considerable accuracy, whether or not an officer was speaking to a black or white resident, based exclusively on a transcript of the conversation.
One of the data scientists on the panel suggested that such a study was a cause for optimism. Her argument was that if we could identify the rate of racial bias, as this study seemingly did, then you could use an adjustment function to underweight the score of the discriminated demographics.
There are downsides, but prevailing techniques in statistical parity, true positive rate comparisons and other “fairness metrics” show promise in identifying bias. In practice, however, this is problematic.
Does Introducing Weights Excuse Racism?
Imagine police are using a given “hotspot” software that uses time, location and historical reports of crime as variables to forecast the risk of a future offense. One possible strategy would be to first identify the propensity for biased policing. Suppose it was determined that a particular grid of a city had a high relative density of racial minorities and was overpoliced. One might, therefore, underweight the variables associated with that city grid, suggesting that these observations in the data set are “less trustworthy”.
In a scenario where many predictor variables are used, similar to contemporary risk terrain forecasts (as opposed to the “What? Where? When?” hotspot model), differing weights could be assigned to the variables. While there are no current examples of pre-offense predictive policing that are targeted to build a forecast for members of a particular demographic, such an idea like adjusting variable weights seems welcomed in addressing the bias in recidivism scores.
Over the years, a growing number of jurisdictions across the United States have incorporated “risk assessment scores” to aid in determining the likelihood of a criminal reoffending. Used primarily in parole or bail hearings, a 2016 ProPublica investigation found that:
In forecasting who would re-offend, the [risk assessment] algorithm made mistakes with black and white defendants at roughly the same rate but in very different ways.
The formula was particularly likely falsely flag black defendants as future criminals, wrongly labeling them this way at almost twice the rate as white defendants.
White defendants were mislabeled as low risk more often than black defendants.
When looking at this, a huge temptation and one that could be inferred from the testimony of aforementioned panelist is, “If Demographic X is being discriminated at a known Bias Rate Y and we have a predictive model that includes demographic information, we can adjust the weight of the variables for Demographic X to compensate for the known Bias Rate Y.”
Despite the best of intentions, there is a serious risk that such a reform is implicitly accepting a degree of racism. In both cases, underweighting a city grid or underweighting the scores of a particular demographic, a degree of bias is accepted and incorporated in the model.
This might be liable to create a false sense that the issue of bias has been “solved”. This could lead to less emphasis being placed on bias training. Law enforcement or the state could feel the need to correct for perceived problems with the model. Individuals involved in the justice system may feel less responsible if they think that, in the end, the model will correct for their actions.
Moreover, this introduces new ethical problems. In the case of pre-offense predictive policing, what if there are changing sentiments in law enforcement that outpace the ability of the predictive algorithm? Predictive models can be poor at accounting for drastic new changes. Suppose that in a particular instance law enforcement is more biased than the assigned weight. In that case, will the tendency be to discount that degree of bias because it does not conform to the norm? Suppose that law enforcement is less biased than the weight. In that instance, does the built in compensation for bias become an unethical privilege to a particular demographic?
Simply put, the rate of bias could also be wrong. These algorithms use historic data, going as far back as ten years in some instances. Is the rate of bias static across time and place? That appears doubtful. Does this then require each jurisdiction that uses algorithms in matters of justice to also quantify how biased their data is? If so, how frequently should this number be audited?
Should We, or Should We Not Include Demographic Information?
In response to accusations of perpetuating racial bias, PredPol, one of the largest vendors of the predictive policing software, has made it a point to emphasize that they do not use certain demographic information like race (though that does little to satisfy the complaint of potential racial disparities for other reasons). This might be for the better, but the jury is still out.
Identifying racial disparities with data has shown to be useful. But, as many activists and scholars have argued, those in the tradition of Foucault’s work on governmentality chief among them, demographic data also risks putting individuals into a historical box and classifying them in bureaucratic categories. Census data for example, was used in the United States for Japanese Internment, in Nazi Germany, it was used to track down the Jewish population. Even if not employed for explicitly malevolent purposes, demographic data can still reinforce a separation. It cements the existence of an “other”.
The question then becomes, what if bias persists, even without demographic data and by eliminating the data we have irreparably harmed our collective ability to identify bias?
Conclusion
The use of algorithms in law enforcement does not appear to be a declining fad. As predictive policing in pre-offense and risk assessment continues to grow, the data science and social science field must remain vigilant. It cannot be forgotten that there are humans on the other side of the model. Data should be interrogated, best practices followed and a dialogue shared.