The adjudication process in collaborative annotation
A common shortfall of supervised systems is the availability of labeled data. Where can I find suitable datasets? is there enough data? how biased are these? is the label selection aligned to my needs?
While often organizations rely on external suppliers to fill this gap, it is a common practice to keep this process in-house, scaling only when required. There are different scenarios where in-house labeling is the best choice: the business domain is complex or specialized (experts do not come cheap), data is changing frequently, fine-grained accuracy is required, regulatory constraints exist, etc. In any of these scenarios, data quality is the main focus.
Shortcuts to produce datasets usually come with steep hidden costs: model performance, data hunger and implementation time. Even though every dataset inevitably contains some bias, poorly labeled, incomplete or inconsistent data (garbage in) will lead to bad results and waste across your whole pipeline (garbage out).
Subject-matter experts team
The most common setup for preparing the ground for data quality is to build a team of subject-matter experts (SMEs). They should be trained iteratively to adhere to the set of annotation guidelines defined for your project.
To distribute the work, there are different options:
- Each SME annotates a separate data subset (no overlapping). Because they are not annotating the same data, the annotations of each other are not easily comparable. A fallback is to delegate to independent SMEs the review of a portion of the annotated items for quality assurance.
- Each data item is annotated by, at least, a specific number of SMEs. This overlapping makes it possible to compare the annotations of each other.
In the first case, there is no much room for optimization. Let's discuss the latter.
Estimating annotation quality is often difficult. It requires a standard to validate your data. However, this standard might be ambiguous or complex. For example, labeling parts of speech, is a well-defined task, whereas classifying question pairs is ambiguous.
Sometimes, the best we can do is to represent this standard with a set of annotation guidelines and a solid annotation schema. In such scenarios, the inter-annotator agreement (IAA) can act as a proxy estimation.
The IAA measures the agreement among your SMEs. Usually, a low IAA is an indicator of poor data quality. Whereas a high IAA means SMEs agree with each other. In most cases, this is enough to confirm your annotations are accurate.
As you can imagine, the first thing you should do to calculate the IAA is to have your SMEs annotate the same data. Depending on the complexity of your project and the experience of your SMEs, it might take a few iterations to come up with the correct guidelines/schema before your team can start any valid calculation. The general recommendation is to start the bulk of your annotation work with a high IAA.
If multiple SMEs work on the same data, as a result, there are multiple annotation versions. We define adjudication as the process to resolve inconsistencies among these versions before a version is promoted to the gold standard. This process is manual, semi-automatic or automatic.
The manual case involves the creation of a single version that integrates the annotations of all SMEs, including the explicit representation of divergences. The reviewers resolve these differences. It is recommended to use such a process, especially at the beginning, with all the team resolving conflicts together to ensure the guidelines are well understood.
If our team is large enough and the degree of ambiguity for annotation tasks is limited, we can partly or fully automate this task using IAA.
The most simplistic approach is to always use the annotations of the SME with the best IAA. We work under the assumption that the person who agrees with most of the other SMEs, is correct. Nevertheless, this is not always the case. For example, if the performance of this SME is not sound across all annotation tasks.
If you want to partially automate the adjudication, we can also apply further tweaks. Remove the outliers measures (in Fig. 4, the numbers from Gerard, 31.29%), or consider only the people with top IAA metrics (in Fig. 4, Vega and Linda). Notice that this only applies to specific environments. We should first investigate each case carefully.
There is room for automation and optimization. For each annotation task, we take the annotations of the SME with the best IAA for that task. The resulting annotations might be composed of the output of different SMEs. This is especially recommendable in a project where different SMEs specialize in specific annotation tasks.
These are just some examples. Analyze the strengths and weaknesses of your team and data to fine-tune the adjudication process.
In-house labeling projects makes sense. You can control quality better and your data is easier to maintain. Depending on the complexity of the domain, to organize a team of SMEs within your organization is cost-effective.
To distribute the work and to ensure quality, it is recommended to guarantee a certain degree of overlapping. Thus, allowing the calculation of the IAA and verifying the agreement among annotation versions.
The adjudication process to obtain a gold standard heavily depends on your team, the complexity of your annotation tasks and how ambiguous these tasks are. For well-defined annotation tasks and a well-trained team, this process can be partially or fully automated using IAA as an adjudicator.
And you, how is your adjudication process? Please share your experiences or ideas.
— — — — — — — — —
I work at tagtog, where we are creating an annotation platform for NLP. Being a witness to the quality issues usually experienced in annotation projects, we have decided to implement some of the mechanisms above explained. Check them out 🍃.