Emotional Paraphrasing

Using transformer-based language models to generate emotions in sentences

Samuel Torche
Empathic Labs
6 min readAug 4, 2021

--

When building chatbots with Natural Language Understanding (NLU), we need a lot of training samples to train intent detection systems. One way is to collect data from users by extracting conversations or using crowdsourcing platforms like Amazon Mechanical Turk, but this is tedious work or costly processes (in addition to the fact that writing sentences by hand is not a fun job). The other way is to develop a language model that is capable of paraphrasing sentences. A paraphrase is a restatement of a text using other words while keeping the same semantic. In this way, it will be possible to create a fraction of the sentences manually and then generate tens of new sentences based on them by paraphrasing them. Adding emotions to these generated paraphrases allows to increase the diversity of the paraphrase and also lays the foundation for the creation of an emotional dialogue architecture.

Photo by Tengyart on Unsplash

Transformer-based language models are becoming extremely powerful, so we will see how we can leverage their power to generate emotional paraphrases.

Concept

Architecture of the system

We propose an architecture composed of three modules: 1) corruption of the emotion present in the input sentence, 2) creation of a sentence that is enhanced emotionally for a specific emotion using a fine-tuned GPT-2 model, and 3) generation of a paraphrase of the emotionally augmented sentence.

1 — Data Corrupter

The Data Corrupter detects and removes emotional words from the sentence. This is done by comparing every word of the sentence against 3 emotional lexicons: DepecheMood, NRC, and SentiWordNet. We use this process on a massive emotional dataset made by unifying 7 emotional datasets under Paul Ekman “6 basic emotions” model: anger, disgust, fear, joy, sadness, and surprise. The 7 datasets we used are GoEmotions, Emotion-Stimulus, CrowdFlower, ISEAR, SMILE, SemEval-2018 Task 1, and TEC. All these datasets use a different emotional model so we need to map these emotions under Ekman’s model using Plutchik’s “Wheels of emotions”, and Parrot’s tree structure model.

Emotional mapping to Ekman’s model

Using the Data Corrupter, we can create 6 datasets, one for each emotion of Ekman’s model, of pairs of emotional sentences and corrupted sentences. If we do not find emotional words in a sentence, we should ignore it.

Number of samples for each dataset

2 — Emotion Enhancer

Using these datasets, we can fine-tune 6 GPT-2 models, one for each emotion. The goal of each model is to reconstruct the emotional sentence using the corrupted sentence, hence learning how to construct a sentence conveying a specific emotion. We call these models Emotion Enhancer.

How to feed pairs of emotional sentences/corrupted sentences to a GPT-2 model

3 — Paraphrase Generator

Our last module, the Paraphrase Generator, is simply composed of a publicly available fine-tuned GPT-2 model for the paraphrase generation task. We use a GPT-2 model provided by RASA.

Example of sentences produced by the system

Evaluation

To evaluate our system, we create 3 testing sets from popular paraphrases datasets: MSCOCO, QQP, and PARANMT. We conducted a qualitative study with human judges, as well as a quantitative evaluation.

Automatic Evaluation

We use the well-known metrics BLEU, METEOR, and TER to evaluate the paraphrasing aspect of our system. These metrics work well for the paraphrases evaluation and correlate well with human judgments (source: Madnani et al. 2012 and Wubben et al. 2010). To evaluate the emotional aspect of our system, we use an emotion classifier and simply compute the percentage of time the emotional classifier predicts that our paraphrase has the expected emotion. We compare our paraphrasing results with state-of-the-art model GAP from Yang et al. (2019).

Paraphrasing results. Average on MSCOCO, QQP, and PARAMNT. BLEU&METEOR: higher is better, TER: lower is better.

Our system is inferior in pure paraphrasing metrics. The Paraphrase Generator module achieves decent scores, but our full emotional architecture performs pretty badly. There are no significant differences between the models for the various emotions.

Emotional results of the automatic evaluation. Each value is the percentage of times the classifier finds the specific emotion in the sentences generated by the model.

For the emotion evaluation, the emotion score is higher for the paraphrases generated by our system for almost all the emotions for every testing set. Note that the surprisingly high values for the surprise emotion in the QQP dataset are accountable to a bias. Indeed, the classifier apparently thinks that a text ending with a question mark suggests a surprise, and there are many of them in QQP. Overall, the results obtained suggest that our joy, anger, sadness, and fear models work bests.

Human Evaluation

A human evaluation is required to assess the performance of the system qualitatively. Twenty sentences (6 from MSCOCO, 7 from QQP, and 7 from PARANMT) are randomly chosen and processed by our system that generates 6 sentences for each input, one for each emotion. The 20 reference paraphrases of these 20 base sentences were also added. In total, the human evaluation testing set is composed of 140 pairs of base sentences and emotional/reference sentences. Human annotators are asked to rate each pair on a scale of 1 to 5, where 1 is worst and 5 is best, the content preservation, readability, and diversity of the sentence compared to the base sentence. They are also tasked with classifying the sentence based on the primary emotion it conveys: anger, disgust, fear, joy, sadness, surprise, or no emotion. Each pair is evaluated by 20 human judges hired on Amazon Mechanical Turk.

How a question is displayed in the Amazon Mechanical Turk survey
Human judges average score for the 3 characteristics (content-preservation, readability, diversity).

The ground truth performs better for content preservation and readability for almost every emotion. Regarding diversity, different emotions can perform better than the ground truth sentences, even if the average is almost the same.

Changes in emotional votes brought by all emotional models for the 6 emotions, in percent

For the emotion evaluation, we compared how the human judges classified the emotion of the references with the emotion of the various models. For example, the number of samples classified by the judges as containing anger doubled between the references and the sentences from the anger model, thus an increase of 100%. We see that our disgust, fear, and sadness models work better than the others as they mostly impact the number of votes for their particular emotion. Asking human judges to classify a text based on seven emotions is challenging, as we reach a Fleiss κ of 0.07, meaning a slight inter-annotators agreement only.

Conclusion

We propose a new approach to perform emotional paraphrases by leveraging pre-trained GPT-2 models. We propose an architecture made of three modules. The first removes initial emotion markers from the original sentence, the second augments the emotion of it using a process to create parallel emotional/corrupted sentences, and the third generates a paraphrase of this augmented sentence. We evaluated our system on popular paraphrase datasets and noticed that both in our automatic and human evaluations our models fail to perform as well as state-of-the-art models in paraphrasing-related metrics. On the other hand, our system is able to generate emotion with some degree of success. Both evaluations agree that the fear and sadness models perform well. Only the quantitative evaluation shows that joy and anger are efficient, whereas, on the contrary, the judges found that disgust was working well.

Such technology can greatly facilitate the automatic creation of training phrases for natural language understanding (NLU) systems, but it can also be integrated into an emotional dialogue architecture to create emotional conversations.

Thanks for reading!

If you want to know more about NLU and how it is used for chatbot creation, read the following article:

--

--