By Jorge Campos
If you have labeled data and different people (or ML systems) have collaborated to label the same subsets of data (e.g. 4 subject-matter experts annotate separately the same subset of legal contracts), you can compare these annotations to have an idea of their quality. If all your annotators make the same annotations independently (high IAA), it means your guidelines are clear and your annotations are most likely correct.
Note that a high IAA doesn’t strictly mean the annotations are correct. It only indicates that the annotators are following the guidelines with a similar understanding.
How to prevent a poor IAA?
There may be several reasons why your annotators do not agree on the annotation tasks. It is important to mitigate these risks as soon as possible by identifying the causes. If you find such an scenario we recommend you to review the following:
- Guidelines are key. If you have a large group of annotators not agreeing on a specific annotation task, it means your guidelines for this task are not clear enough. Try to provide representative examples for different scenarios, discuss boundary cases and remove ambiguity. If it makes sense (e.g. for parts of a system or schemes), attach pictures to the guidelines.
- Try to be specific. If annotation tasks are too broadly defined or ambiguous, there is room for different interpretations, and eventually disagreement. On the other hand, very rich and granular tasks can be difficult for annotators to annotate accurately. Depending on the scope of your project, find the best trade-off between high specific annotations and affordable annotations.
- Test reliability. Before starting annotating large amounts of data, it is good to make several assessments with a sample of the data. Once the team members have annotated this data, check the IAA and improve your guidelines or train your team accordingly.
- Train. Make sure you train appropriately the members joining the annotation project. If you find annotators that do not agree with most of members from the team, investigate the reasons, make your guidelines evolve and train them further.
- Check how heterogeneous is your data. If your data/documents are very different from each other either in complexity or structure, a larger effort would be required to stabilize the agreement. It is recommended to split the data into homogeneous groups.
Bear in mind: Labeling data is an iterative process. As in Agile, Inspect, and Adapt.
You can find more detailed info on IAA and data quality checking in our documentation: docs.tagtog.net