Sentiment Analysis, Part 2 — How to choose pre-annotated datasets for Sentiment Analysis?

Published in

Besedo Engineering Blog

6 min readFeb 23, 2022

This is the second blog post of the series about Sentiment Analysis. If you want to read the first-named Sentiment Analysis, Part 1 — A friendly guide to Sentiment Analysis, you’ll find it here. At Besedo, we are using Sentiment Analysis for content moderation purposes and exploring its usage to provide valuable insights to our clients.

When working on a subject like Sentiment Analysis, we may want to use open-source datasets specifically annotated for this task. It could be tempting to use any dataset you can find on the internet, but it is crucial to consider some details: Does the dataset match our needs? Does it have the correct license? How is the dataset annotated? It is essential to mention that data wrongly annotated may lead to poor model performances. These questions are the same when working for any machine learning subject, not only Sentiment Analysis.

In this blog post, we would like to share what we learned when we searched datasets for our study. We will present the criteria you need to consider before searching for a dataset and the different types of datasets you can find. Then, we will show you the different ways of annotating a dataset for Sentiment Analysis and their pros and cons. Here, the goal is to help you find the best datasets suited for your needs.

What criteria should you pay attention to?

The main criteria that you should take into account are the following: the type of analysis, the level of analysis, the labels, and the license. These are things you have to think about before beginning your research. If you want to know more about the three first criteria, you may want to check our other blog post: Sentiment Analysis, Part 1 — A friendly guide to Sentiment Analysis.

Type of analysis

The dataset you choose needs to be adapted to the type of analysis you want to perform. If you simply want to associate a text to a sentiment, you will need a dataset annotated for that. If you are using ABSA (Aspect-Based Sentiment Analysis), you will need a dataset annotated for its three sub-tasks (OTE, ACD, and SP).

Level of analysis

As you may know, you can perform Sentiment Analysis on a text level or a sentence level. Using a dataset annotated for a sentence-level analysis for a text-level one is okay. Still, you can not use a dataset annotated text-level for a sentence-level analysis as the text is not divided into sentences.

Text-level: “I love the sun. I really like the rain.”

The first task of a sentence-level Sentiment Analysis is to cut the text into sentences. If this is not done, you lose a piece of precious information. Here, we need to know that “I love the sun” and “I really like the rain.” are both sentences and what sentiment they are associated with.

Labels

Do you perform a two-label analysis? Three-label? More? Your dataset should reflect that and be annotated (at least) for as many labels as you need. For example, if you want information about Positive, Negative, and Neutral sentiments, you should not use a dataset containing only Positive and Negative sentiments.

License

It may seem obvious but you really have to be careful about the license associated with your dataset. A lot of famous datasets usable for Sentiment Analysis are for research purposes only. If you are not sure about the license, we suggest you contact the dataset’s creator.

Here are some open-source licenses: MIT License, BSD 3-Clause license, Apache License 2.0.

Types of datasets for Sentiment Analysis

There are two main types of datasets you can use for Sentiment Analysis, we recommend you to use the one that is closer to your need.

Reviews

“I love this product. Everything about it is great: the colour, the sound, ALL!” — Positive

Reviews are mainly associated with positive or negative sentiment. Neutral sentiment is almost not represented, probably because when you write a review, you usually have a positive or negative point of view about the product.

Famous datasets:

Social Media

“I walked by the lake today. There were a lot of swans.” — Neutral

The sentiment mostly used in this type of dataset is neutral because not everything you say is associated with a sentiment. For this type of dataset, you may want to use any social media data.

Famous datasets:

Pay attention to the annotation

There are two significant ways to annotate data: manual or automatic annotation. We are going to present the downsides and upsides of both. We will also show you the bias of automatic annotation.

Manual or automatic annotation?

The type of text is as important as the type of annotation when searching for datasets.

positive and negative aspects of manual and automatic annotation — Pros and cons of manual and automatic annotation

Bias in automatic annotation

When we are generating labels for sentiment analysis in an automatic way we can introduce some bias. It is important to pay attention to it. Here are some we observed during our research.

Datasets annotated via ratings

Datasets made of reviews are almost always annotated thanks to the ratings given with the review, as followed:

Ternary: 1 and 2 = negative, 3 = neutral, 4 and 5 = positive
Binary: 1 and 2 = neutral, 4 and 5 = positive

But this annotation can cause two major issues: the subjectivity of a review and the different targets of a review.

First, let’s talk about the subjectivity issue. If we have a review like “hotel rooms badly isolated“ we can have the situation where one person will attribute a rating of 3/5 and another 2/5 for the same experience because it is really subjective. Following the ternary matching rule presented above, we have two situations where we have a neutral sentiment and in another a negative one. It is not about who is right or who is wrong, but that clearly shows that those datasets may initially have some bias, and it is really hard to avoid them when training a model.
The other bias observed is when the target of the rating (usually the product) is not the same as the target of the review. If you receive a product in two weeks instead of two days, we assume you will not be really pleased. Then, you’ll probably give a bad review, but it may not stop you from rating the product a 4/5. Some websites ask for different ratings for the product and the delivery, but it is not always the case. That will impact your analysis as a bad review will be associated with a good rating. A potential solution to this problem may be the application of Named Entity Recognition (NER) to find out if there is a presence of a product or delivery content, for example. The problem is that it adds complexity to the final solution, and it also demands knowledge about how to work with NER models.

Annotation via emojis

A lot of datasets in Sentiment Analysis are automatically annotated thanks to the emojis in the text. That is usually the case in Twitter datasets. The dataset may be annotated as follows:

Ternary : :)= positive, :( = negative and no emoji = neutral
Binary : :) = positive and :(= negative

This kind of annotation is easy but is also linked to some issues you have to be familiar with. First, it is important to decide how you would annotate the following text:

Okay — Neutral

Okay :) — Positive? Neutral?

Do you think the second text is Positive just because you have a smiling emoji?

It is also possible that the emoji does not reflect the sentiment of a text. In our precedent blog post, we exposed that one problem of Sentiment Analysis was the irony or sarcasm. These two can be represented by an emoji that is contrary to the sentiment of a text:

I’m mad so avoid me today :D

But even without talking about irony, it is also standard in a lot of social media to use sad or mad emojis for positive sentiment.

I love him soooo much :’(

This blog post’s goal is to help you find the datasets that suit best your need by using simple criteria. It is also essential to pay attention to how the annotation was done; in the case of automatic annotation, you might experience some bias.

We hope that this blog post will help you find the perfect dataset for your study!

The third blog post of this series will be about Data annotation and will be posted soon, so keep your eyes open!