Multi-Emotion Detection in Brazilian Tweets — Machine Learning

Published in

Sinch Blog

4 min readAug 5, 2022

Have you ever watched Ultron in Avengers movie? Imagine a Machine capable of understanding, comprehending and answering according to environment and people actions; this is the big dream of Natural Language Processing (NLP) — 👀 not destroy the world but understand the natural language (text, audio).

But how can a machine (or even we) understand text? Semantic is a first guess. Semantic is the study of meanings: word meaning, sentence meaning, whole text meaning, and so on. One of the areas of semantics is sentiment analysis. In sentiment analysis, we want to identify the sentiment of the author through the written text. Also, in some cases, we can detect multiple sentiments in a sentence. For example, “Oh thanks boss, I’m so happy about this promotion.” can be labeled as “positive”, “joy”, or even “surprise” sentiments. Imagine understanding the sentiments of a tweet. In this blog-post, we are going to develop a solution to detect multiple sentiments in a tweet.

Problem statement. Given a tweet (small text), we must assign nine emotional labels to it.

Multi-Emotion Detection Problem

According to Plutchik’s Wheel of Emotions (1986), emotions can be based on four basic emotional axes, the emotion pairs (or axes) are joy x sadness, anger x fear, trust x disgust and surprise x anticipation; also, a extra label named neutral (none of previous sentiments). We are going to explore the Brazilian Stock Market Tweets with Emotions dataset to identify these emotions in Brazilian stock market tweets. This dataset holds 4,517 tweets labeled with these nine sentiments.

In Machine Learning area, we can see this problem as multilabel classification task where we must assign a set of emotional labels (called classes) for each tweet. There are a few ways to tackle this problem: (i) transform each class vector into a single class, that is, each vector (combination of sentiments) will be a class for a given sentence; (ii) develop a binary classifier for each class (so nine binary classifiers), where each classifier predicts absence or presence of a class; or (iii) specific case for this dataset, develop four multiclass classifiers, one for each sentiment axis, if they all return 0 it is considered as neutral. Note, the strategy (ii) and (iii) usually performs better, but for simplicity (and for being generic for any multilabel task) we are going to implement the first strategy.

Algorithm Development

We are going to implement the first multilabel strategy using: (1) preprocessing for text normalization; (2) TF-IDF for feature extraction (transform the normalized texts into vectors); and (3) Decision Tree for sentiment classification.

(1) Preprocessing, we are going to perform two simple normalization (i) lower the text and (ii) remove the accents. I create a class for this preprocessor:

import pandas as pd
from unidecode import unidecode
from sklearn.base import TransformerMixin, BaseEstimatorclass SimpleTextPreprocessor(BaseEstimator, TransformerMixin):
    """
    Text preprocessing includes steps:
        - Lower case
        - Remove accents
    """
    def __init__(self):
        pass

    def fit(self, X, y=None):
        return self

    def transform(self, X, *_):
        data = pd.Series(X) if not isinstance(X, pd.Series) else X
        data = data.apply(self._preprocess_text)
        return data

    def _preprocess_text(self, text):
        # handed functions
        pre_text = text.lower()
        pre_text = unidecode(pre_text)
        return pre_text

(2) TF-IDF (Term Frequency-Inverse Document Frequency) is an approach based on word count, in which the vector of a tweet is represented by the importance of the words present in it.

(3) Decision Tree is a supervised learning algorithm designed for classification or regression tasks. Its algorithm predicts the labels of a tweet by learning decision rules inferred from the attributes of the data (in our case, on the values of the columns of the vectors resulting from the TF-IDF).

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.tree import DecisionTreeClassifier
from sklearn.pipeline import Pipeline

pipeline = Pipeline(steps=[
    ('normalize', SimpleTextPreprocessor()), 
    ('features', TfidfVectorizer(
        ngram_range=(1, 2), analyzer='word',
        sublinear_tf=True, max_features=3_000,
        max_df=0.9, preprocessor=None
    )),
    ('classifier', DecisionTreeClassifier(random_state=1))
])

Experiments

We can notice that the overall accuracy of the model is 63.77% using a simple pre-processing. However, the performance of the sentiment classes (apart from the neutral class) were not good; which suggests that the model focused on the majority class (neutral) and “forgot” the other classes. Anyway, we can notice that it was able to detect some feelings (in addition to neutral) in the tweets, such as “anger” and “disgust.”

Conclusion

In this blog-post, we developed one solution for classifying multiple sentiments in a tweet based on multilabel classification, using Text Preprocessing, TF-IDF and Decision Tree. With this solution, we reached 63.77% of accuracy in predicting the correct set of labels for a tweet.

As future work, (i) we can explore binary classification techniques, or different multilabel strategies; (ii) we can explore symbolic solutions and identify the most important words/expressions for each feeling; or (iii) explore more advanced machine learning techniques, such as BERT. See the complete code in the Jupyter Notebook:

NLP - Multi-Emotion Detection in Tweets

Explore and run machine learning code with Kaggle Notebooks | Using data from Brazilian Stock Market Tweets with…

www.kaggle.com

Reference

Fernando J. Vieira da Silva, et al. Stock market tweets annotated with emotions (2020)
Lars Buitinck, et al. Multi-emotion Detection in User-Generated Reviews (2015)
Plutchik R. and H. Kellerman. Emotion: Theory, Research and Experience (1986)