‘Less Bad’ Bias: An analysis of the Allegheny Family Screening Tool

Artificial Intelligence is commonly used to alleviate the cognitive burden of everyday decision-making. But what happens when those decisions grow more consequential, having a direct effect on the personal freedom of others? We were asked to analyze a tool being used locally by the Allegheny County Department of Human Services to determine a child’s risk of being abused, and defend or protest its implementation.

Imagine being on trial. You learn that as a matter of course, the assessment of something other than the judge and jury will factor significantly into your sentencing — that of a computer.

This is the reality — and an incredibly problematic one — in states like Wisconsin, where algorithmic recidivism forecasting is typically incorporated into sentencing proceedings. There, the tools are proprietary to their creators, and despite persistent criticism, their inner workings remain wholly obfuscated with minimal public oversight (Israni, 2017).

Outsourcing crucial decisions in this manner appears to violate universally adopted human rights laws: it seems unacceptable to leave such pivotal thought processes to a computer when there’s little, if any, indication of its impartiality.

These types of assessments are not exclusive to the world of criminal justice — we are subjected to the outcomes of algorithmic bias with nearly every brush with a computer. Such outcomes go mostly unnoticed, and beckon far less outside scrutiny and iterative input than is warranted by their ability to profoundly impact the world. A notable exception, however, be found in the Allegheny Family Screening Tool, whose creators take every possible step to stay open and transparent. If we are to learn anything about the way algorithms used at scale should be implemented, the AFST is the one to teach it. This paper will of course address the ways the AFST, algorithms, and data collection and analysis at large fall short from their perceived impartiality, but its purpose is to walk through the direct and indirect benefits of the tool’s implementation, while underscoring the idea that persistent outside skepticism and a system of checks and balances are foundational to the ethical application of algorithms.

To fully grasp the AFST’s positive impact, it’s important to first note the circumstances that brought the tool into existence (a diagram of the process and the biases involved is provided on page 5). The Allegheny County Department of Human Services receives over 15,000 calls to report the endangerment of a child per year. With just over 30 screeners (Allegheny County, 2017) on staff to field these reports, the amount of time available to examine records related to each case restricts their ability to make fully informed recommendations (see Fig. 1, Step 02). This leads to conclusions largely made by a gut check, based on a fraction of available evidence and a personal bias partially formed by previous experience with screenings (Hurley, 2018). Despite having the best of intentions, this makes screeners prone to inaccuracy in judgment. The AFST produces a risk score for the child derived from all available evidence, providing screeners with an informed second opinion before deciding whether to launch full investigations.

This image was created to accompany an article about the AFST appearing in the Pittsburgh Post-Gazette. I wonder what emotions it was intended to conjure.

One cause for concern about the AFST is that the scores it produces have an extremely limited audience — there’s no opportunity for the families it affects to view or understand how their score is produced. But this lack of transparency is also a crucial contributor to the tool’s ethical soundness — the score can only be viewed by the screening staff, and is not even provided to caseworkers or judicial parties should the case continue to move through the system (Allegheny County, 2017). This ensures that judicial decisions are not influenced by the score (see Fig.1, Step 04), and that those decisions aren’t then fed back into the tool directly as a means for future assessment.

This stands in contrast to the Wisconsin recidivism forecaster described above, which is used specifically by the sentencing judge, and the outcome of sentencing directly contributes to the statistics used to produce future scores. This creates an accelerated positive feedback loop for the algorithm, where the forecasts it makes are self-fulfilling and harmfully biased. Although this particular facet of the AFST lacks transparency out of necessity, it otherwise operates with heavy oversight by the public, both on the sides of use and development. The Allegheny Department of Human Services staff are always cognizant of the gravity involved with taking a child out of their homes, and at no point has the decision to implement this tool in that process been taken for granted. In development, it was subject to input from stakeholders involved at all levels of the child protection system (Hurley, 2018), and studies of the tool’s impact are ongoing as of 2018 (Allegheny County, 2017), meaning that since its inception, it has been under constant scrutiny.

In contrast to the Post-Gazette image above, actual implementation of the tool is far less futuristic—even boring—with heavy human oversight. (source)

This abundance of oversight is also evidenced in that the tool is always deferential to the human screener. As explained in the AFST’s frequently asked questions document, the tool was never intended to override human judgment, but rather to aid the decision making process by lowering cognitive load. Even in the case that the tool recommends “mandatory screen-in,” the human screener is still given the option to override that recommendation and provide feedback to improve the tool (Allegheny County, 2017).

These measures help to make the AFST less susceptible to the flagrant stereotyping often present in the results of widely used tools like Google search, which are based on algorithms with tight, often unregulated feedback loops. Safia Umoja Noble reveals the disturbing results of unchecked algorithms whose patterns of assessment are based purely on positive feedback as input: she analyzes Google search results for terms related to minority groups, and explains how, by being presented in an authoritative design language, result in uninformed public opinion taken as fact, thus reinforcing the stereotypes from which they stemmed (2018).

Unlike Google, the AFST was designed with humans at the heart of the algorithm’s operation, and its developers are extremely cautious and mindful of the harm algorithms have the potential to cause, and are proactive in their efforts to minimize it. Undeniably, some families endure an upheaval as the result of miscategorization by the tool by being subjected to unwarranted investigations. But considering the measures noted previously, ultimately this algorithm is likely being held to a higher level of scrutiny than any other in place today, and perhaps even to many call screeners.

This chart outlines the steps involved in a child welfare call in Allegheny County, with the largest sources of biases noted as part of the process. The intent is that it can be applied similarly to other highly-biased decision making processes, including those involving algorithms, as a method of critique and evaluation.

Despite all of these checks and balances in place to help ensure that the algorithm used by AFST is fair and accurate, it will likely always be subject to criticism based on the data it uses to make its assessments. The tool relies solely on publicly available data to make its assessments, yielding concerns that those who depend on such services are more likely to be subject to investigations. This is undoubtedly true, but if there were no AFST, the county would rely on the same data set, interpreted by singular unchecked human bias. Furthermore, use of public services doesn’t correlate with a worse score: According to the tool’s frequently asked questions, for example, a beneficiary of the SNAP program (ie, a family that receives a welfare allotment for the purchase of food) is more likely to receive a lower AFST score (Allegheny County, 2017).

Although which data is utilized and interpreted can be controlled by the tool’s creators, as Mimi Onuoha points out, all data, even those collected with the best intentions, is inherently biased and prone to unforeseen reinterpretation. Data collection, by its very nature, sets out to categorize and find patterns (2016, 2017), much like the act of stereotyping. Thus, even though the creators of the AFST take precautions to remove the more obvious demographic biases, and recognizes those same biases reflected in the metrics it chooses to use, the very fact that the tool uses data at al. means that it will always have some form of partiality. The best we can hope is that the tool is “less bad” (Hurley, 2018) than its 100% human counterparts.

The universal truth is that judgment of others, in any form, is always fraught with inaccuracy — but is nonetheless a necessary element of being a member of a society. And the Allegheny Family Screening Tool is striving to do this in the best way possible. The child welfare system in Allegheny County is aware of its failings, and as far as interventions within a system go (Meadows, 1999 and Dahle, 2018), the fact that the county chose to fundamentally change the system structure by adding the tool means that it is receptive to monumental shifts for the better.

Thus the tool should continue to be used, and continue to be intensely scrutinized — not only for the sake of the families entangled with it, but as an example for all practices surrounding the handlings of algorithms and data. Perhaps one day, it will be a flag bearer of ethical implementation, helping to set a national/industry standard. In many ways, this is the perfect situation for such a precedent to be set — where the potential pitfalls and stakes are so obvious — so that more of the world will become cognizant of the power algorithms wield in more unseen places.

Currently, United States policymakers are largely inept at having important conversations around algorithms and data, likely because so much of it is guarded in black boxes. This was painfully evident in the Cambridge Analytica hearings, where senators’ questions to Mark Zuckerberg often showcased a complete lack of understanding regarding the purpose of the hearings. As the lines distinguishing public and private continue to blur, these conversations only grow more important. With the AFST playing a role, we finally have the basis for a quality conversation of what defines an algorithmic tool done right, or rather, “less bad.”

Sources Cited

Allegheny County Department of Human Services. Allegheny Family Screening Tool frequently asked questions. (2017, July 20).

Courtland, R. (2018). Bias detectives: The researchers striving to make algorithms fair. Nature, 558(7710), 357–360. doi:10.1038/d41586–018–05469–3

Dahle, C. (2018). Designing for transitions: Addressing the problem of global overfishing. Cuadernos, 213–33.

Hurley, D. (2018, January 02). Can an algorithm tell when kids are in danger?

Israni, E. T. (2017, October 26). When an algorithm helps send you to prison.

Meadows, D. (1999). Leverage points: Places to intervene in a system.

Noble, S. U. (2018). Algorithms of oppression: How search engines reinforce racism. New York: New York University Press.

Onuoha, M. (2016, February 10). The point of collection — Data & Society: Points.

Onuoha, M. (2017) How we became machine readable.