References for LLMs as Annotators

Contains Sources for “Can LLMs be used to label data”?

Maja Pavlovic
4 min readApr 2, 2024

Studies from the Tables

[1] Can Large Language Models Capture Dissenting Human Voices? Noah Lee, Na Min An, James Thorne

[2] Whose Opinions Do Language Models Reflect? Shibani Santurkar, Esin Durmus, Faisal Ladhak, Cinoo Lee, Percy Liang, Tatsunori Hashimoto

[3] Can large language models transform computational social science? Caleb Ziems, William Held, Omar Shaikh, Jiaao Chen, Zhehao Zhang, Diyi Yang

[4] Can ChatGPT Reproduce Human-Generated Labels? A Study of Social Computing Tasks Yiming Zhu, Peixian Zhang, Ehsan-Ul Haq, Pan Hui, Gareth Tyson

[5] ChatGPT Outperforms Crowd-Workers for Text-Annotation Tasks
Fabrizio Gilardi, Meysam Alizadeh, Maël Kubli

[6] ChatGPT-4 Outperforms Experts and Crowd Workers in Annotating Political Twitter Messages with Zero-Shot Learning Petter Törnberg

[7] Are large language models good annotators? Jay Mohta, Kenan Emir Ak, Yan Xu, Mingwei Shen

[8] Is GPT-3 a Good Data Annotator? Bosheng Ding, Chengwei Qin, Linlin Liu, Yew Ken Chia, Shafiq Joty, Boyang Li, Lidong Bing

[9] AnnoLLM: Making Large Language Models to Be Better Crowdsourced Annotators Xingwei He, Zhenghao Lin, Yeyun Gong, A-Long Jin, Hang Zhang, Chen Lin, Jian Jiao, Siu Ming Yiu, Nan Duan, Weizhu Chen

[10] Is ChatGPT better than Human Annotators? Potential and Limitations of ChatGPT in Explaining Implicit Hate Speech Fan Huang, Haewoon Kwak, Jisun An

[11] LLMs Accelerate Annotation for Medical Information Extraction Akshay Goel, Almog Gueta, Omry Gilon, Chang Liu, Sofia Erell, Lan Huong Nguyen, Xiaohong Hao, Bolous Jaber, Shashir Reddy, Rupesh Kartha, Jean Steiner, Itay Laish, Amir Feder

[12] Want To Reduce Labeling Cost? GPT-3 Can Help Shuohang Wang, Yang Liu, Yichong Xu, Chenguang Zhu, Michael Zeng

Additional Resources:

Towards Measuring the Representation of Subjective Global Opinions in Language Models Esin Durmus, Karina Nyugen, Thomas I. Liao, Nicholas Schiefer, Amanda Askell, Anton Bakhtin, Carol Chen, Zac Hatfield-Dodds, Danny Hernandez, Nicholas Joseph, Liane Lovitt, Sam McCandlish, Orowa Sikder, Alex Tamkin, Janel Thamkul, Jared Kaplan, Jack Clark, Deep Ganguli

Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models Aarohi Srivastava, and others

The illusion of artificial inclusion William Agnew, A. Stevie Bergman, Jennifer Chien, Mark Díaz, Seliem El-Sayed, Jaylen Pittman, Shakir Mohamed, Kevin R. McKee

A Comprehensive Overview of Large Language Models Humza Naveed, Asad Ullah Khan, Shi Qiu, Muhammad Saqib, Saeed Anwar, Muhammad Usman, Naveed Akhtar, Nick Barnes, Ajmal Mian

Exploring the Sensitivity of LLMs’ Decision-Making Capabilities: Insights from Prompt Variation and Hyperparameters Manikanta Loya, Divya Anand Sinha, Richard Futrell

Quantifying Language Models’ Sensitivity to Spurious Features in Prompt Design or: How I learned to start worrying about prompt formatting
Melanie Sclar, Yejin Choi, Yulia Tsvetkov, Alane Suhr

Language Models are Few-Shot Learners Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, Dario Amodei

MaxMin-RLHF: Towards Equitable Alignment of Large Language Models with Diverse Human Preferences Souradip Chakraborty, Jiahao Qiu, Hui Yuan, Alec Koppel, Furong Huang, Dinesh Manocha, Amrit Singh Bedi, Mengdi Wang

Do LLMs exhibit human-like response biases? A case study in survey design Lindia Tjuatja, Valerie Chen, Sherry Tongshuang Wu, Ameet Talwalkar, Graham Neubig

Using Large Language Models to Simulate Multiple Humans and Replicate Human Subject Studies Gati Aher, Rosa I. Arriaga, Adam Tauman Kalai

Quantifying the Persona Effect in LLM Simulations​ Tiancheng Hu, Nigel Collier

COMMUNITY ADDITIONS & later Add-ons:

If you know of a study that covers labelling with LLMs, please do let me know in the comments here or in the main article and I will add it to the resources below for others to find. I would focus on classification tasks but happy to add broader ones too as long as they fit within the overall topic…

TOXIGEN: A Large-Scale Machine-Generated Dataset for Adversarial and Implicit Hate Speech Detection Thomas Hartvigsen, Saadia Gabriel, Hamid Palangi, Maarten Sap, Dipankar Ray, Ece Kamar

The Importance of Human-Labeled Data in the Era of LLMs Yang Liu

Large Language Models as Financial Data Annotators: A Study on Effectiveness and Efficiency Toyin Aguda, Suchetha Siddagangappa, Elena Kochkina, Simerjot Kaur, Dongsheng Wang, Charese Smiley, Sameena Shah

Sensitivity, Performance, Robustness: Deconstructing the Effect of Sociodemographic Prompting Tilman Beck, Hendrik Schuff, Anne Lauscher, Iryna Gurevych

Open-Source Large Language Models Outperform Crowd Workers and Approach ChatGPT in Text-Annotation Tasks Meysam Alizadeh, Maël Kubli, Zeynab Samei, Shirin Dehghani, Juan Diego Bermeo, Maria Korobeynikova, Fabrizio Gilardi

LLMAAA: Making Large Language Models as Active Annotators Ruoyu Zhang, Yanzeng Li, Yongliang Ma, Ming Zhou, Lei Zou

You Are An Expert Linguistic Annotator: Limits of LLMs as Analyzers of Abstract Meaning Representation Allyson Ettinger, Jena Hwang, Valentina Pyatkin, Chandra Bhagavatula, Yejin Choi

Unlisted

--

--

Maja Pavlovic

Google DeepMind PhD Scholar, simplifying Data Science and Deep Learning concepts || London (UK) ||