Are current benchmarks sufficient? Assessing AI performance in Global Health applications
While generative AI offers a possibility of revolutionizing healthcare planning and decision making in low- and low-middle income countries (LMIC), there are valid concerns over privacy, ethics, data security, and misuse of data.
While there are open source options for LLMs, their computing requirements exceed the resources available in most LMICs.
Further, these LLMs are useful in generalized settings, but can struggle with very specific use cases (e.g., HIV program planning in Malawi) or jargon and acronyms. This means bespoke solutions, including uptraining smaller LLMs for deployment in LMICs along with retrieval-augmented generation could thread the needle of feasibility while offering high performance to support decision makers with limited time and resources.
Currently, there is an apparent dearth of appropriate benchmarks for AI applications in global health and development. As we’ve pursued our own AI projects, we’ve been reviewing available benchmark datasets to inform our own strategy for assessing our AI endeavors.
We searched arXiv, Papers with Code, and Hugging Face, for papers on generative AI LLMs to catalog the benchmarks used by the biggest and most commonly used models. We further sought to supplement with snowball sampling, reviewing popular blog posts, and searching Github for additional options.
We briefly categorized each benchmark in terms of its type of assessment, such as reading comprehension, commonsense reasoning, and safety and truthfulness. We further labeled each in terms of its topic area, such as medical, statistics, and legal information.
After identifying nearly 60 benchmarks, we confirmed our earlier suspicion that there was no extant benchmark or dataset to adequately test LLM performance in global health settings.
This has prompted us to begin developing our own, bespoke benchmarks for AI applications in Africa on primary health care and HIV-specific use cases. We hope that these newly developed benchmarks will not only be useful for our own model deployments, but for others doing the same.
Please find a link to our repository here. We know our search was not completely exhaustive. With the rapid pace of AI advances, we hope that others may direct us to relevant benchmarks as well. Please let us know!