Human-in-the-Evaluation: AI Roadmap to Culturally Responsive Evaluation

Fidutam
Fidutam
Published in
7 min readApr 19, 2024

Authored by: Maya Sherman, Editorial Writer, Fidutam
Edited by: Leher Gulati, Editorial Director, Fidutam

Recent times have exhibited a surrealist vision of data-driven reality, saturated with generative AI and open-source interfaces. The symbiotic intersection between evidence-based policies, evaluations, and data science techniques is not novel to the scholarly community. Nonetheless, the current AI revolution upholds unprecedented socio-cultural opportunities for the generation of responsible evaluation in the international community. Here, we discuss the promise and peril of generative AI in the design of culturally responsive evaluation (CRE), considering the multifaceted role of data innovation in human decision-making processes. This article highlights generative AI as a socio-cultural accelerator for evaluators while moderating its ambiguous loopholes in a bid to promote responsible practices of evaluative synthesis.

To do so, we analyse the potential role of Artificial Intelligence (AI) in redesigning the discipline while acknowledging its ambiguous impact on evaluative synthesis. The article will conclude with suggested solutions promoting responsible evaluation through the use of AI-driven feedback and evaluation prototyping. These solutions accentuate the increased need for evaluators to maintain reliable and high-quality outputs while operating at an algorithmic scale and cultural sensitivity.

The design of CRE relies on sophisticated human thought and judgement, as the process requires advanced analytical steps and socio-cultural reasoning beyond the initial phases of data collection and categorization. When deployed responsibly, AI has the potential to expand the scope of evaluators, diversify their analytical perceptions, and eliminate inevitable biases through built-in AI-driven feedback mechanisms. This can add another layer of analytical thought to human evaluation, exceeding the automatic function of content generation and thereby fulfilling the vision of human-machine collaboration in the scientific community.

Theoretical Framework

The scholarly discourse concerning evaluation methods highlights the importance of cultural and cross-regional sensitivity. The increased number of global conflicts, crises, and catastrophes in recent years requires a transformative approach to the evaluation of planetary scenarios in agile and contextual settings.

Cultural factors are historically known for their methodological complexity and collective nature, attributing a sense of belonging to their members. Due to the primary influence of the dominant culture on educational assessment, scholars recommend strengthening multicultural content and strategies in pedagogical frameworks (Taylor & Nolen, 2022).

Discussion

The adoption of data science methods in evaluation is crucial to scaling the scope and capacity of educators and evaluators globally. Notable examples include the analysis of imagery data in power mapping and impact evaluation (World Bank Group, n.d.). Most studies suggest that combining manual and automated approaches to evaluation can lead to significant benefits and still maintain data accuracy in text extraction and categorization.

Critically, AI-driven methods foster a more systematic and efficient analysis of datasets and increase the traditional human scope in real time. Most AI-driven use cases in evaluation include, but are not limited to, risk identification, automated coding of core delivery challenges, and impact evaluation (Rona-Tas et al., 2019; Abdellatif et al., 2015; Cimiano et al., 2005; Tanguy et al., 2016). Manual categorization is more likely to be time-consuming, requiring the evaluators to focus on smaller datasets or factors. Moreover, AI methods can enhance manual classification by automatically adding documents and inputs to the data models.

Nevertheless, socio-cultural biases are still more likely to repeat in AI models, especially since datasets are more likely to remain unrepresentative. Therefore, AI’s inherent fallacies, resulting in mimicking historical and social biases, can exacerbate discriminatory patterns against vulnerable groups. The Brazilian case study and the violent implications of improper translation of Meta’s features across indigenous communities demonstrate the lack of linguistic adaptation to indigenous communities (Hagerty & Rubinov, 2019). Enhancing the capacity and computing skills of data-driven models is therefore required to safeguard the analytical diversity of CREs.

Importantly, contemporary evaluators have already piloted data science techniques in their work and experimented with miscellaneous AI methods (Bravo, Hagh, Joseph, Kambe, Xiang, & Vaessen, 2023; Franzen, Quang, Schweizer, Budzier, Gold, Vellez, Ramirez, & Raimondo, 2022). Potential use cases include text editing and summaries, code learning, search recommendations and brainstorming, and potentially input grading and feedback (IEG, n.d.). This reinforces the role of generative AI as a human amplifier and assistant, but not a replacement.

Along with the above, this article highlights the core issues of reduced data quality and human trust due to generative AI. First, while increased automation can replace certain manual functions and turn them obsolete, junior evaluators might use the automated outputs to produce an abundant number of evaluations without much review of their content or be more dependent on the automated output. This leads to the urgent need to guarantee the high quality of CRE at scale once human evaluators deploy AI-driven tools to support their work.

Second, due to digital novelty, the level of trust in the machine is relatively low, particularly in critical decision-making. The rationale exceeds the article’s scope and is associated with the legal discourse of the multifaceted phases of human-machine interactions. Nonetheless, the article posits that it is imperative to provide better feedback mechanisms and transparency for automated and predictive evaluations so that AI-driven tools will be considered reliable sources.

Between Constitutional AI and the Evaluation Prototype

This article posits that to deal with automated content generation and cross-cultural intersections, human-machines interactions are crucial to supporting CREs and, broadly, a culturally responsive society, enabling the exchange of human narratives over a technological bridge. It suggests the deployment of constitutional AI and evaluative prototyping amongst the evaluators’ community to moderate socio-cultural biases across human practitioners and algorithms.

The accelerated exposure to generative AI requires a critical review of AI loopholes in evaluations, particularly fake content generation or the frequency of data hallucinations. Malign actors can use generative AI to generate fake news, foster public chaos, and distort evidence used by evaluators. Notably, AI’s complex calculations and predictions require human-machine collaboration, increasing the scope of CRE and fostering a critical review of the human and the artificial to mitigate socio-cultural or technical biases.

By deploying constitutional AI, evaluators can train models on evaluative frameworks and ensure the quality thresholds of automated outputs. The ability to establish a guiding constitution for algorithms has the potential to moderate the biased outputs of generative language models, referring to reliable guidelines. Tech practitioners and evaluators can combine indigenous data and bottom-up collection into these feedback models and enable reliable evidence for cross-cultural analysis.

To enhance the feedback mechanism of hybrid evaluation, driven by AI methods, the deployment of Evaluation Prototyping is highlighted. This article defines the term as: “a human centered design approach that aims to improve the process and outcomes of evaluations”. While this taxonomy is borrowed from product design, it has already been experimented in policymaking to test and predict the potential impact of certain regulatory modifications. In the above context, AI methods can be incorporated to examine what are the possible inputs and outputs of evaluators, and the reactions they might receive, before they occur de-facto. This will turn the AI function as an operational assistant to an evaluative multiplier, fostering responsible practices of Predictive Evaluation.

More importantly, this suggests a proactive approach to evaluation, potentially preventing harms and biases based on the examination of hypothetical cultural scenarios before the actual delivery of a specific policy instrument or intervention. Evaluation prototyping offers the mimic of the evaluator’s ecosystem. This is a supporting agent examining whether a certain intervention is needed, and how an evaluator could potentially react from multifaceted geographical perspectives, based on their previous evaluations.

Conclusion

The above analysis of AI-driven CRE leads to a holistic approach of human-machine evaluation. It aspires to provide a set of evaluative phases empowered by generative AI, simulating the full analytical cycle and providing feedback. Human evaluation using generative AI can foster multiculturalism across sectors and, more importantly, a responsible evaluator community. The hybrid evaluative synthesis does not prioritise human or machine supremacy but channels each entity’s functionality.

Using the above evaluative approaches can promote the anticipation of unusual patterns in the international community and allocate the relevant safeguards and interventions to prevent them, or moderate their harms due to socio-cultural biases. Critically, this article aspires to highlight generative AI as a socio-cultural accelerator for evaluators while moderating its ambiguous loopholes in a bid to promote responsible practices of evaluative synthesis.

References

Abdellatif, M., W. Atherton, R. Alkhaddar, and Y. Osman. (2015). “Flood Risk Assessment for Urban Water System in a Changing Climate Using Artificial Neural Network.” Natural Hazards 79 (2): 1059–77.

Bravo, L., Hagh, A., Joseph, R., Kambe, H., Xiang, Y., & Vaessen, J. (2023). Machine Learning in Evaluative Synthesis Lessons from Private Sector Evaluation in the World Bank Group. IEG. Available at: methods_paper-machine_learning.pdf (worldbankgroup.org)

Cimiano, P., Pivk, A., Schmidt-Thieme, L., & Staab., S. (2005). Learning Taxonomic Relations from Heterogeneous Sources of Evidence. In Ontology Learning from Text: Methods, Evaluation and Applications (Frontiers in Artificial Intelligence and Applications, vol. 123)

Franzen, S., Quang, C., Schweizer, L., Budzier, A., Gold, J., Vellez, M., Ramirez, S., & Raimondo, E. (2022). Advanced Content Analysis: Can Artificial Intelligence Accelerate Theory-Driven Complex Program Evaluation? IEG Methods and Evaluation Capacity Development Working Paper Series. Independent Evaluation Group. Washington, DC: World Bank. https://ieg.worldbankgroup.org/methods-resource/ advanced-content-analysis-can-artificial-intelligence-accelerate-theory-driven-complex/

Hagerty, A., & Rubinov, I. (2019). Global AI ethics: a review of the social impacts and ethical implications of artificial intelligence. arXiv preprint arXiv:1907.07892.

IEG. (n.d.) AI Assistance content checklist. Available at: https://www.betterevaluation.org/sites/default/files/2023-10/YEE%20-%20AI%20Assistance%20content%20checklist%20%282%29.pdf

Rona-Tas, A., Cornuéjols, A., Blanchemanche, S., Duroy, A. & Martin, C. (2019). Existing Supervised Machine Learning in Mapping Scientific Uncertainty Expressed in Food Risk Analysis. Sociological Methods & Research, 48 (3): 608–41.

Tanguy, L., Tulechki, N., Urieli, A., Hermann, E., & Raynal, C. (2016). Natural Language Processing for Aviation Safety Reports: From Classification to Interactive Analysis. Computers in Industry, 78, 80–95.

Taylor, C. S., & Nolen, S. B. (2022). Culturally and socially responsible assessment: theory, research, and practice. Teachers College Press.

World Bank Group. (n.d.). Machine Learning in Evaluative Synthesis. Available at: https://ieg.worldbankgroup.org/evaluations/machine-learning-evaluative-synthesis/chapter-1-machine-learning-applications

Follow Fidutam for more insights on responsible technology!

--

--