What happens when human workers oversee algorithmic tools?

5 min readMay 25, 2022

Hao-Fei Cheng, Kenneth Holstein, Anna Kawakami, Venkatesh Sivaraman, Logan Stapleton, Steven Wu, & Haiyi Zhu

This blog post is a follow-up to a recently published AP article, which cites findings from our recent research. The two research projects that are referenced in the AP article as “CMU research” were conducted separately by two research teams, spanning multiple universities. This blog post is written by researchers from just one of those teams, but all relevant research papers are linked at the bottom of this post. Although our teams’ data analyses were conducted separately and involved different analysis choices, it is striking that both teams arrived at some of the same overall conclusions, discussed below.

We appreciate the AP reporters’ efforts to convey our findings as part of their broader story. However, as sometimes happens in journalism that covers academic research, much of what we discussed with journalists over the course of several phone calls seems to have been lost in translation or cut for space. As a result, we believe that the article leaves out context that is necessary for interpreting our findings. The framing and presentation of academic research in mass media can systematically affect how audiences come to understand the findings, and can impact important decisions about policy for science and technology. Therefore, we think it is critical to clarify our research findings for a general audience. Below we clarify a few key points that we hope will be helpful.

What was the focus of our research?

In public agencies across the country, algorithms are being used to guide important decisions about individuals and families. But amid the controversy over the use of algorithms, the humans who operate these algorithms and make final decisions have received relatively little attention.

We have been studying the adoption of an algorithmic tool called the Allegheny Family Screening Tool (AFST), which has been used to assist child maltreatment hotline workers’ screening decisions at Allegheny County’s Office of Children, Youth, and Families (CYF) since 2016.

In our research, we investigated how call workers weigh algorithmic recommendations alongside their own judgment when deciding to investigate possible cases of child maltreatment. In cases where call workers do disagree with the algorithm, we examined how the quality and demographic disparities of the resulting decisions may be affected. Our research found that call workers help to mitigate the risks of algorithmic errors and reduce demographic disparities in the resulting decisions. Our work demonstrates the importance of empowering workers to disagree with algorithmic recommendations.

What does “hypothetical” mean in our study designs?

The AP article is written in a way that seems to imply that we (researchers) are in disagreement with representatives from Allegheny County regarding whether our analysis is “hypothetical” or not. There is no actual disagreement here. The algorithm is part of a human decision process, and the output of the algorithm was just one of many sources of information that call workers had access to. The actual phone call, which is a fundamental element of the decision process, is not observed by the algorithm.

Our research intentionally compared workers’ decisions in reality versus what would happen in a hypothetical scenario in which workers uncritically agreed with every recommendation from the tool. We find this analysis important because even though the county instructs call workers instructions to only use the algorithm as an additional source of complementary information, there is a risk that call workers could indiscriminately adhere to recommendations and not use appropriate discretion. Additionally, in cases where the algorithm’s risk score exceeds a certain threshold, call workers require approval from a supervisor to override the algorithm’s recommendation, which creates an additional source of friction for disagreement in these cases. Both Allegheny County officials and our research team think an “autonomous algorithm” scenario is a bad idea.

In fact, our results demonstrated the value of the County’s staff of trained hotline workers, whom we spoke with extensively as part of our research. Had these workers uncritically followed the algorithm instead of applying their discretion, differences in screen-in rates across demographic groups would likely have been greater.

What disparities were we measuring?

As discussed in our paper (linked below), the primary disparity measure we explored is the difference in the screen-in rate between Black children versus white children. This corresponds to one of the simplest and most popular algorithmic fairness notions, called demographic parity.

Our results demonstrate that call workers’ disagreements with algorithmic recommendations served to reduce disparities between Black and white children, compared to a hypothetical scenario where they had uncritically agreed with the algorithm. This is an interesting and important finding in itself, in light of prior research (conducted in other settings, such as criminal justice) that suggests human discretion in algorithm-assisted decision-making often increases disparities. However, it is important to note that disparities in screening rates, on their own, do not necessarily imply unfairness. For instance, a higher screen-in rate for one group may be justified if there is truly higher need or higher risk of maltreatment among children in that group. For this reason, naively trying to equalize screening rates across different demographic groups without regard for actual children’s needs and safety could be harmful and counterproductive.

It is clear, however, from our interviews and observations of call workers and their supervisors that this naive approach is not how workers reduced disparities in algorithmic recommendations. Rather, workers engaged in holistic assessments of child risk and safety based on all of the information available to them, while attempting to compensate for limitations they perceived in the algorithm.

Why does this matter?

Our research, though conducted specifically in a child welfare setting, speaks to broader questions around the use of algorithms in public decision-making. In the best case, combining algorithms and human experts would result in better decisions — but in many cases, it’s more complicated than that. In fact, prior research in settings such as criminal justice has often shown just the opposite: humans may selectively apply algorithmic recommendations, resulting in decisions that perpetuate or even amplify existing biases.

To some strong proponents of the use of algorithms in high-stakes decision-making, findings like these might beg the question: if workers’ oversight may drive down the quality of some decisions, how much discretionary power should workers really have? Our work, along with other research that was referenced in the AP article, suggests caution in overgeneralizing prior results to all decision contexts involving algorithms. At least in the case of child maltreatment hotline screening, workers seem to play an essential role in moderating negative impacts of algorithmic recommendations.

The contrast between our findings and those of prior studies might be due to a number of factors, including the expertise of the decision makers involved, the information workers have access to, organizational policies surrounding the use of algorithmic recommendations, the way workers are trained to use an algorithm, and so on. Although we agree with arguments that human oversight alone is not a panacea, our research provides important evidence that supporting workers’ discretionary power can be critical for the responsible deployment of algorithms in high-stakes settings.

Where can you go to learn more?

You can find our research paper linked below:

How Child Welfare Workers Reduce Racial Disparities in Algorithmic Decisions

The second research paper that is referenced in the article (conducted by a separate research team) can be found here:

A Case for Humans in the Loop: Decisions in the Presence of Misestimated Algorithmic Scores

If you are interested in learning more about call workers’ and supervisors’ experiences and challenges in working with the AFST, you may also be interested in this paper:

Improving Human-AI Partnerships in Child Welfare: Understanding Worker Practices, Challenges, and Desires for Algorithmic Decision Support