A Case for Humans-in-the-Loop:
Decisions in the Presence of Erroneous Algorithmic Scores
Risk assessment tools are increasingly being incorporated into expert decision-making pipelines across domains such as criminal justice, education, health, and public services [1,2,3,4]. These tools, which range from simple regressions to more complex machine learning models, distill available information on a given case into a risk score reflecting the likelihood of one or more adverse outcomes.
Most often, the tools are not meant to make autonomous decisions, but rather to provide a recommendation or a summary of potentially relevant information that an expert can incorporate into their decision making process. In these contexts, the successful deployment of the tool relies on the complementarity of humans and algorithms. Yet there is much that remains understudied in how decision-making changes when risk assessment tools are introduced, and what elements of tool design and deployment produce improved decision outcomes.
Using data from before and after the deployment of a risk assessment tool in a child welfare decision-making setting, in our work  we study the question:
Are decision-makers (here, child abuse hotline call workers) able to identify cases where the tool significantly misestimates risk and to dismiss such information? Our analysis relies on a technical issue that arose in deployment that resulted in input features, and hence also model outputs, being miscalculated in realtime.
We find that call workers alter their behavior when the tool was deployed, and also that they were less likely to adhere to the machine’s recommendation when the score displayed was misestimated. This behavior persists even when overriding the recommendation requires supervisory approval.
The Allegheny Family Screening Tool
In the summer of 2016 Allegheny County implemented an algorithmic decision-support tool for use in child abuse call screening. The tool, known as AFST (Allegheny Family Screening Tool), was developed in an effort to better concentrate resources on investigating cases where the children are at greatest risk of adverse child welfare outcomes [6,7]. Many were hopeful that deploying such a tool could help more children who are in danger. Others have expressed significant concern that such tools would lead to biased decision-making that subjects families experiencing poverty to heightened scrutiny.
The AFST was designed to assist call workers in deciding which calls to the child abuse hotline should be screened in for further investigation, and which should be screened out. It is a tool meant for recommendation, not automation. The score produced by the tool is based on an estimate of likelihood that the children involved in a given call will experience adverse child welfare system events in the near future.
During a call, the worker is presented with a risk score (1–20, low to high risk) that is estimated using features queried from multi-system administrative data. The intended use of the tool is to help workers identify high-risk cases in instances where the information communicated in the call may be insufficient, inconclusive, or otherwise incomplete in reflecting the long-arc risk of the children. Usage guidelines were created to strongly encourage screen-ins (investigations) for the highest scoring cases. Figure 1 shows the decision pipeline.
Some time after deployment, it was discovered that a technical glitch had resulted in a subset of model inputs being incorrectly calculated in real time. This in turn led to misestimated risk scores being shown for some of the cases. While the misestimation was often mild, and the shown score generally provided reasonable risk information, the glitch permitted us a rare opportunity to investigate real world decision making in the presence of misestimated risk. Figure 2 shows the magnitude of the glitch.
Before proceeding, we pause to make an important point. These types of technical issues are not uncommon. What is uncommon is for organizations to choose to be transparent about their occurrence. We recognize Allegheny County for their transparency and hope that this approach will become the norm in the deployment of algorithmic systems in sensitive societal domains.
Algorithm aversion and automation bias
There are two competing tendencies that have been observed in the literature on human decision-making in the presence of automated decision-support systems: algorithm aversion and automation bias.
- Algorithm aversion is the tendency to ignore the tool’s recommendations after seeing that they can be erroneous. Lack of agency , low degree of transparency of the algorithm  and accuracy  decrease users’ reliance on the system. Algorithm aversion has been widely discussed in the context of recidivism prediction .
- Automation bias is the tendency to follow the recommendations despite available information that would indicate that the recommendation is wrong. A prominent example is that of high-tech cockpits, who have been found prone to relying blindly on automated cues as a heuristic replacement for vigilant information seeking .
Algorithm aversion has been mostly discussed in prognostic tasks, where predictions pertain future outcomes, the levels of automation are often lower, and the degree of uncertainty is higher. Automation bias, instead, has been mostly found in diagnostic tasks with high levels of automation and low degree of uncertainty, where it can be assumed that there is a “ground truth” that is in principle currently available to humans.
Did algorithm aversion lead to the tool being ignored altogether?
Our analysis of the data demonstrates that call workers did alter their behavior in the post-deployment period.
Figure 3 shows the screen-in rates across values of the assessed risk score. The steeper slope of the post-deployment curve, particularly for very high and very low risk cases, indicates that post-deployment screen-ins are better aligned with the score. We see a pronounced increase in the screen-in rate of highest risk cases, and a pronounced decrease in the screen in rates for low and moderate risk cases. Meanwhile, the overall screen-in rate did not vary before and after deployment likely due to resources constraints, remaining around 45%.
Did automation bias lead to blind adherence to the tool’s recommendations?
We find that call workers were less likely to adhere to erroneous recommendations.
For this analysis, we use the fact that the shown score did not always correspond to the assessed score during the analyzed post-deployment period. Therefore, in some cases the score shown corresponded to an underestimation of risk, while in others it corresponded to an overestimation.
Figure 4 (left) shows the percentage of screen-in’s for a binned version of the shown score. Correct indicates that the assessed score is equal to the shown score, underestimation means that the score shown was lower than the assessed score, and overestimation means that the score shown was higher than the assessed score. If the call workers were to blindly follow the risk tool, we would observe that within each score bucket the screen-in rates are the same across all three classes. We observe something very different. It is evident from the analysis that, among cases with similar shown scores, cases for which the score was underestimated are screened in at much higher rates than the others. This means that humans are, in the aggregate, able to identify that the risk is being underestimated for these cases. This suggests that the call workers make use of other pieces of information — either from the call or from the administrative data system directly — and respond appropriately by screening in with higher probability.
While Figure 4 (left) allows us to observe certain patterns, it does not differentiate between over and underestimations of different magnitude. For that reason, we zoom-in on the decisions made for assessed mandatories, which correspond to those that would have been predicted to be very high-risk in the absence of the glitch. Figure 4 (right) shows the proportion of cases with assessed mandatory screen-in that were actually screened in, for each bucket of shown score. This rate is approximately constant across all buckets, suggesting that the underestimation of the assessed score had no effect on the call workers’ actions and they were able to make use of other information to identify these high risk cases. Even when the shown placement score was more than 12 points lower than the assessed placement score, the screen-in was still as likely as for a shown mandatory case. Screen-in rates are uniformly very high for cases that should have been mandatory screen-ins, irrespective of the score shown.
Lessons for future design and deployment of recommendation systems
In our workshop paper , we reflect on the findings of our analysis as well as on the past studies of AFST [6, 7]. We identify two sets of characteristics of the deployment that may inform the design of future systems and research studies.
Deployment of the tool:
- Task design and instructions: call workers had received explicit instructions to treat the score as complementary information and not as a replacement of their own judgement.
- Expertise of the user: call workers had extensive experience in decision making before the deployment of the tool and remained the ultimate decision makers.
- Availability of data sources: call workers had access to the administrative data system that was not affected by the glitch. They had practice incorporation information from the data system into their decision making process.
- Absence of explanations: while explanations are typically thought as desirable, and may have led to the glitch being detected earlier, they could have misled users into over-relying on the tool, making them less likely to seek out further information from the data system.
- Audit of the risk assessment tool: the discovery of the glitch was possible because the team responsible for the deployment of the tool has engaged in continuous auditing of the system.
- Audit of the decision process: periodic — instead of one-time — audits of the tool should consider the framework into which the tool is embedded to evaluate how users are interpreting and using its recommendations.
- Audit of stakeholders: it is possible that the recurrent audits and engagement of stakeholders have increased the vigilance of call workers to potential mistakes of the tool.
Due to the observational nature of the data and the systematic nature of the glitch, we are unable to definitively state which of these factors most greatly contributed to the positive observed decision behavior. Further research is needed to experimentally assess the effects of different algorithm design and deployment factors on the overall success of algorithm-in-the-loop decision-making systems.
 Amanda Kube, Sanmay Das, and Patrick J Fowler. 2019. Allocating interventions based on predicted outcomes: A case study on homelessness services. In Proceedings of the AAAI Conference on Artificial Intelligence.
 Danielle Leah Kehl and Samuel Ari Kessler. 2017. Algorithms in the criminal justice system: Assessing the use of risk assessments in sentencing. (2017).
 Rich Caruana, Yin Lou, Johannes Gehrke, Paul Koch, Marc Sturm, and Noemie Elhadad. 2015. Intelligible models for healthcare: Predicting pneumonia risk and hospital 30-day readmission. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 1721–1730.
 Vernon C Smith, Adam Lange, and Daniel R Huston. 2012. Predictive modeling to forecast student outcomes and drive effective interventions in online community college courses. Journal of Asynchronous Learning Networks 16, 3 (2012), 51–61.
 Maria De-Arteaga*, Riccardo Fogliato*, Alexandra Chouldechova. 2019. A case for humans-in-the-loop: decisions in the presence of erroneous algorithmic scores. In Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems (CHI), ACM.
 Alexandra Chouldechova, Diana Benavides-Prado, Oleksandr Fialko, and Rhema Vaithianathan. 2018. A case study of algorithm-assisted decision making in child maltreatment hotline screening decisions. In Conference on Fairness, Accountability and Transparency. 134–148.
 Anna Brown, Alexandra Chouldechova, Emily Putnam-Hornstein, Andrew Tobin, and Rhema Vaithianathan. 2019. Toward Algorithmic Accountability in Public Services: A Qualitative Study of Affected Community Perspectives on Algorithmic Decision-making in Child Welfare Services. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems. ACM, 41.
 Joa Sang Lim and Marcus O’Connor. 1995. Judgemental adjustment of initial forecasts: Its effectiveness and biases. Journal of Behavioral Decision Making 8, 3 (1995), 149–168.
 Michael Yeomans, Anuj Shah, Sendhil Mullainathan, and Jon Kleinberg. 2017. Making sense of recommendations. Journal of Behavioral Decision Making (2017).
 Kun Yu, Shlomo Berkovsky, Dan Conway, Ronnie Taib, Jianlong Zhou, and Fang Chen. 2016. Trust and reliance based on system accuracy. In Proceedings of the 2016 Conference on User Modeling Adaptation and Personalization. ACM, 223–227.
 Matthew DeMichele, Peter Baumgartner, Kelle Barrick, Megan Comfort, Samuel Scaggs, and Shilpi Misra. 2018. What do criminal justice professionals think about risk assessment at pretrial? Available at SSRN 3168490 (2018).
 Kathleen L. Mosier, Linda J. Skitka, Susan Heers, and Mark Burdick. 1998b. Automation Bias: Decision Making and Performance in High-Tech Cockpits. The International Journal of Aviation Psychology 8, 1 (1998), 47–63.
 Riccardo Fogliato*, Maria De-Arteaga*, and Alexandra Chouldechova. Lessons from the deployment of an algorithmic tool in child welfare. Fair & Responsible AI Workshop, CHI, 2020.