Sentiment Analysis, Part 3— Data annotation

Published in

Besedo Engineering Blog

6 min readFeb 28, 2022

This blog post is part of a series about Sentiment Analysis. We already have two blog posts about this series, A friendly guide to Sentiment Analysis and How to choose useful pre-annotated datasets for Sentiment Analysis?. Hopefully, after reading both, you are now up to date on Sentiment Analysis and how to select a pre-annotated dataset. Here, we would like to expose the annotation we used during our study.

Just a reminder: at Besedo, we are using Sentiment Analysis for content moderation and to provide insights to our clients.

First, what is data annotation? It is the process of labelling data according to the study you are performing. It can be done on any type of data: image, video, text, etc. The better the annotation better will be the model.

The tools and tips detailed below can be applied to any annotation type.

We consider three main steps for the annotation task: preliminary steps, manual or automatic annotation using active learning, and post-processing. We will go through them more specifically in the different parts of this blog post.

Preliminary steps

Before beginning to annotate some texts, you must choose an annotation tool and write some guidelines for the annotators. Indeed, several people often make annotations, and an annotation guide is essential as we need consistency.

Since the data is considered gold, many annotation tools have appeared over the years. At Besedo, we use Prodigy’s paying tool (Prodigy · An Annotation Tool for AI, Machine Learning & NLP, 2017). Here is a non-exhaustive list of some favourite tools:

Choosing the right annotation tool is important as you need an easily understandable and easy-to-use tool.

While writing the annotation guide, make sure you mention every detail possible:

Explain how the annotation tool works
Show how they should annotate
Define the chosen labels
It may be nice to give at least one or two examples for each label
Give a few examples of “edge cases”: edge cases are examples hard to annotate, and you need to expose them to help the annotators make the right choice

To write the best annotation guide possible, it is important that you know the subject, and what is better than annotating some data yourself to do that? We suggest you spend a little bit of time annotating some parts of the data to understand better what needs to be specified in the annotation guide.

In the following parts, we will present two of the most used annotation techniques: manual annotation and active learning.

Manual annotation

One of the solutions to annotate your data is to use human resources. Human resources can be found within your team or company, but some use services like Amazon Mechanical Turk. This is a great solution and should provide excellent annotations. The major problem with manual annotation is that it takes a lot of resources, time, and money (to pay the annotators). You also need to make sure that everyone annotating understood the task.

To ensure the best annotation possible, it is common to ask several persons to annotate the same content. This is called cross-annotation; it will allow you to calculate an inter-rater score (how many times the annotations were the same). Inconsistent annotations can then be escalated to a super annotator with experience, for example. This kind of annotation is nice, but as we mentioned earlier, it will take a lot of time and require a lot of annotators (or a lot of time) if you want an extensive dataset.

The solution we propose takes into account the need for evaluation and discussion while making the annotation process shorter. Our solution is divided into three parts:

The inter-rate scores can be computed using different metrics. Here are the ones we prefer:

Cohen’s Kappa allows you to calculate an inter-rate score between two annotators while taking into account the chance agreement. It can be applied to binary or nominal data.
Fleiss’ Kappa can calculate the score between multiple annotators and takes into account the chance agreement. It can be applied to binary or nominal data.
Pearson’s correlation calculates the correlation between two annotators and is applied to ordinal data (for example, in Sentiment Analysis Negative and Neutral sentiments are closer than Negative and Positive sentiments).

Annotation with active learning

A quick definition

You can think of active learning as a machine learning algorithm that needs human intervention to give some outputs. The goal of the machine is to get humans to annotate complex texts, for instance, when the machine is not confident enough. As the machine needs to make predictions, it needs a sample of annotated data to work. In Prodigy, the machine will show a pre-chosen label, and you either accept or decline it. This also means that the task is easier for annotators because instead of picking a label and accepting it, they only have to press one button.

The machine will then self-update and show other texts it is unsure about. Then, when you think it is time, train a model to give predictions and evaluate it on the data annotated during the active learning phase.

→ If you are satisfied with the results, stop here, download the model and predict the annotations on the rest of the data

→ If you are not satisfied, perform another iteration of active learning

To make it more visual, here is how it should work:

How did we use it?

There are different ways to use the model to annotate data. If the model seems reasonable enough, you can annotate the whole dataset with it. But you would often only keep the decisions for which the model is the most certain.

For this, we decided to put a threshold on the model’s confidence score. This threshold was decided by calculating the precision of each of our classes for each threshold (varying from 0 to 1). Once we achieved a threshold in the precision score (for us, it was 0.99), we kept the associated threshold. Then, we decided that every text associated with a confidence score below the threshold would be manually annotated.

It is a bit more time-consuming, but the annotation will be better as models tend to have bias and make mistakes that can easily be caught.

Post-processing

Once the annotations are done, it is nice to check their quality, either by evaluating it, as we mentioned in the manual annotation section, or using another technique.

If you have an id associated with an annotator, you can print metrics by the annotator. It will allow you to help the annotators that may not have understood the task and have a clear view of the task’s difficulty.

Another option is to annotate a sample of the annotated data and calculate an inter-rater score between your annotation and the annotated data. It could help determine if the annotators understood the task and annotated the data in an intended way.

We hope that this blog post gave you ideas for future annotation tasks!

The following blog of this series will be about the models we used for our Sentiment Analysis task. We worked with Machine Learning and Deep Learning models (including Transformers) to give a quick preview. If you find it interesting, stay tuned!

References

Pearson Product-Moment Correlation — When you should run this test, the range of values the coefficient can take and how to measure strength of association. (s. d.). Laerd Statistics.
Prodigy · An annotation tool for AI, Machine Learning & NLP. (2017). Prodigy. Prodigy · An annotation tool for AI, Machine Learning & NLP
Ziai, A. (2018). Inter-rater agreement Kappas — Towards Data Science. Medium. https://towardsdatascience.com/inter-rater-agreement-kappas-69cd8b91ff75