How to Build Reliable Human Annotation Guidelines with LLMs

Paper review, practical applications at TR Labs and implications for AI practitioners

Published in

Thomson Reuters Labs

8 min readAug 26, 2024

Human evaluation is considered the benchmark for measuring the quality of Natural Language Generation (NLG) systems. The annotation guideline ensures the reproducibility of the annotations. At Thomson Reuters, where editors’ expertise is a valuable asset, utilizing their knowledge in AI annotation tasks to create gold-standard data is a common practice. The annotations are then typically used to train models or assess models’ performance. However, there is a lack of best practices for creating reliable human annotation guidelines.

A recent outstanding paper accepted by NAACL 2024, titled “Defining and Detecting Vulnerability in Human Evaluation Guidelines: A Preliminary Study Towards Reliable NLG Evaluation,” addresses this crucial issue. The study identifies eight common vulnerabilities in human annotation guidelines and proposes a principled approach using LLMs to create more reliable guidelines with minimal vulnerabilities.

Effective annotation guidelines should avoid ambiguous definitions, unclear rating systems, and assumptions about annotators’ prior knowledge. They should use clear, simple language to ensure that annotators fully understand task requirements and scoring standards without bias towards particular labels. This review will explore the paper’s findings and their implications for improving human evaluation practices in NLG.

Paper Review

Vulnerabilities in Human Annotation Guidelines

Researchers from Peking University conducted a comprehensive study to identify vulnerabilities in human annotation guidelines, analyzing both authentic and synthetic examples, as explained below. This investigation is crucial for understanding the current state of evaluation practices in the field of NLG.

Authentic Guidelines: The study examined 227 authentic guidelines obtained from papers accepted at major NLP conferences (ACL, EMNLP, and NAACL) between 2020 and 2022. A striking finding was that only 29.84% of papers involving human evaluation actually released their guidelines. This low percentage is concerning, as human evaluation guidelines are essential for ensuring the reliability and reproducibility of assessment results.
Synthetic Guidelines: To complement the analysis of authentic guidelines, the researchers generated 239 synthetic guidelines using GPT-3.5-Turbo. They employed five different prompting strategies to create a diverse set of guidelines, allowing them to explore how different prompting techniques influence the quality and vulnerability of AI-generated guidelines.

Taxonomy of Guideline Vulnerability

Through a thorough analysis of both authentic and synthetic guidelines, the researchers identified and defined a comprehensive typology consisting of eight common vulnerabilities in human annotation guidelines:

Ethical Issues (EthI): Guidelines often fail to address potential ethical implications related to the evaluation process. This includes considerations such as privacy, cultural sensitivity, accessibility, and the potential misuse of evaluation results.
Unconscious Bias (UncB): Some instructions may inadvertently favor or disadvantage certain results, introducing unintended bias into the evaluation process. This can skew the outcomes and compromise the validity of the assessment.
Ambiguous Definition (AmbD): When task definitions are unclear, vague, or imprecise, they can be interpreted in multiple ways. This ambiguity can lead to inconsistent evaluations across different annotators, reducing the reliability of the results.
Unclear Rating (UncR): Guidelines that lack standardized criteria for evaluating aspects or fail to clearly define each point on a rating scale can result in inconsistent ratings. This vulnerability undermines the comparability and reliability of the evaluation outcomes.
Edge Cases (EdgC): Instructions that don’t specify how to handle exceptional situations or edge cases that don’t fit neatly into usual categories or criteria can lead to confusion and inconsistent handling of such instances.
Prior Knowledge Assumption (Prik): Some guidelines assume that evaluators possess certain background knowledge or familiarity with specific subject matter, tools, or principles. This assumption can lead to misunderstandings or inconsistencies in evaluation, especially when evaluators have diverse backgrounds.
Inflexible Instructions (InfI): Overly complex or rigid instructions can make it difficult for evaluators to follow the guidelines and adapt to variations in data or task requirements.
Others (OthE): Used to complete the typology.

Data Annotation Process and Results

The researchers employed a rigorous annotation process to identify vulnerabilities in both authentic and synthetic guidelines. Key findings include:

77.09% of authentic guidelines exhibited notable vulnerabilities.
57.3% of synthetic guidelines showed notable vulnerabilities.
Ambiguous Definition and Unclear Rating were the most frequent vulnerabilities in both sets.

Distributions of vulnerability types on authentic and synthetic guidelines in percentage. EthI, UncB, AmbD, UncR, EdgC, PriK, InfI, OthE refers to Ethical Issues, Unconscious Bias, Ambiguous Definition, Unclear Rating, Edge Cases, Prior Knowledge, Inflexible Instructions and Others respectively. “None” means the guideline has no vulnerability at all. The ratio calculation is achieved by taking the number of guidelines that include a particular category and dividing it by the total count of guidelines.

Structured instructions with evaluation aspects produced the lowest vulnerability ratio among the five prompt variations (see figure below) used to create synthetic guidelines.

Five prompt variations used for generating synthetic guidelines.

Identifying Vulnerabilities using LLMs

The researchers conducted experiments to assess the capability of various LLMs in automatically detecting vulnerabilities in evaluation guidelines. They compared open-source and closed-source LLMs, as well as traditional baseline models, including BERT, XLNet and ALBERT.

The problem is framed as a multi-label vulnerability type classification task, with model input being the annotation guidelines, and output being the type of vulnerabilities with 8 unique labels to choose from. The researchers fine-tuned LLMs and baseline models using the labeled data they obtained from their Data Annotation study.

Experiment Results:

TEXTDAVINCI-003 with Chain-of-Thought (CoT) and vulnerability description prompt strategy in the few-shot scenario demonstrated superior performance, outperforming the baseline models by 20% in macro-F1 score across both authentic and synthetic guidelines. Notably, TEXTDAVINCI-003 can not only generate completely correct answers, but also narrow down the scope of vulnerabilities in its reasoning, facilitating the correction of identified vulnerabilities in the guidelines.
Open-source LLMs struggled to generate answers as instructed, possibly due to limited training data and excessive instruction length.

A Principled Way to Creating Reliable Human Annotation Guidelines with LLMs

One of the most significant contributions of this paper is its proposal of an easy-to-implement, principled method for creating reliable human annotation guidelines with minimal vulnerabilities with the help of LLMs. This includes 3 main steps:

Draft human evaluation/annotation guidelines with LLMs by providing specific requirements, as the authors found that the proportion of vulnerabilities in authentic guidelines is much higher than LLM-generated synthetic guidelines.
Manually modify the guideline draft adhering to principles that address common vulnerabilities.
Use TEXTDAVINCI-003 with Chain-of-thought in the few-shot scenario to automatically identify potential vulnerabilities and further revise the guideline.

Conclusion

This paper not only educates us about common vulnerabilities in human annotation guidelines but also equips us with practical methods to detect and mitigate them. The proposed method for creating reliable guidelines with LLMs represents a significant step forward in enhancing the quality of human evaluations in NLP. As we continue to advance in this field, the integration of AI tools like LLMs with human expertise will be crucial in ensuring the integrity and reproducibility of our research findings.

Learnings from the Paper

Further Thoughts and Practical Application at TR Labs

I came across this paper and it resonates with me a lot since I just finished assigning an annotation task to our legal subject matter experts (SMEs) at TR Labs. My annotation task is to ask SMEs to grade Causes of Action (CoA) generated by different LLMs by examining case snippets that contain headnotes and synopses from different cases (a task example is shown in the figure below). Many of the vulnerabilities identified in the paper also appear in my annotation task.

CoA classification can be very subjective and requires SMEs’ legal expertise and prior knowledge. Also, many CoAs are very similar, thus different SMEs may confuse two CoAs, or interpret the same CoA in different ways. For example, “Copyright Crimes > Infringement” is a CoA that should only be used for a criminal charge of infringement, not civil cases. And “Copyright Violation > General Infringement” is very similar to “Copyright Crimes > Infringement” so that SMEs may confuse when to use which. Hence, the very detailed inclusions and exclusions of each ambiguous CoA need to be explained. Luckily, we were able to obtain such a comprehensive CoA taxonomy list from other teams. This largely avoided the “ambiguous definition” and “edge case” vulnerabilities.

Another guideline vulnerability I encountered is “unclear rating”. A four-point scale grading is used in my annotation task: A-perfect, C-Not correct, but understandable, D-wrong, and F-embarrassing, but such a grading scale is still too high-level, and a clearer and more detailed explanation of each scale is needed and is different for different tasks. For my task, the most unclear rating is in option C-not correct, but understandable. SMEs previously thought that all CoAs that are inaccurate but can be inferred in the headnote and synopsis can be graded as C, but after discussion, what we truly want is to make this rating less strict, so that even if the CoA can’t be inferred in the case snippet, as long as it is something related to the issues or similar facts discussed, it can also be graded as C. Again, detailed, clear, and unambiguous definition and rating are critical for a reliable human annotation study, as the paper proposed.

Implications for AI Practitioners

The research findings, in conjunction with our practical experiences, indicate several recommendations in annotation practices for any AI department or individual that relies on data annotation to obtain gold data to improve their machine learning systems:

Current Practices Review: Conduct a thorough review of your existing annotation guidelines across different projects at your organization. This review should assess your guidelines against the eight vulnerability types identified in the paper.
Guideline Creation Process: Integrate the paper’s proposed three-step method for creating reliable guidelines using LLMs into your workflow. This would involve:
- Using LLMs to draft initial guidelines.
- Having your project leads and SMEs review and refine these draft.
- Employing LLMs again to detect potential vulnerabilities in the refined guidelines
Standardization: Develop a standardized template for annotation guidelines that addresses common vulnerabilities, ensuring consistency across different projects and teams.
Training and Awareness: Conduct training sessions for your teams involved in creating annotation guidelines, raising awareness about these vulnerabilities and how to avoid them.
Continuous Improvement: Implement a feedback loop where annotators can report ambiguities or issues with guidelines, allowing for continuous refinement.
Internal Knowledge Dissemination: Increase the internal accessibility of annotation guidelines to promote knowledge sharing and enhance the reproducibility and reliability of AI development processes within your organization.

By addressing these points, we can enhance the quality and reliability of our AI training data, improve the consistency of our model evaluations, and potentially reduce the time and resources spent on resolving annotation discrepancies.

TR Labs is committed to learning innovative methods in research and incorporating best practices from industry, as well as through collaborations with third-party vendors. The principled approach to building reliable annotation guidelines — drafting with LLMs, using LLMs to automatically identify vulnerabilities, and then further revising by human experts or even incorporating LLM-suggested revisions — is an innovative and easily applicable method. This approach can be adopted by TR Labs and any organization conducting human annotation tasks, offering a promising avenue for improving the development and refinement of annotation guidelines.

References

Jie Ruan, Wenqing Wang, and Xiaojun Wan. 2024. Defining and Detecting Vulnerability in Human Evaluation Guidelines: A Preliminary Study Towards Reliable NLG Evaluation. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 7965–7989, Mexico City, Mexico. Association for Computational Linguistics.