What can we learn from emojis?
A quick overview of the ideas behind our new paper ‘Using millions of emoji occurrences to learn any-domain representations for detecting sentiment, emotion and sarcasm’ and a code example showing how easy it is to apply our model pretrained on 1.2 billion tweets to new tasks.
Why emotion analysis?
For a long time I’ve felt that there’s been an important opportunity to improve the modeling of emotional content in text. While most computer science research in this field has focused on positive/negative sentiment analysis, the three dominant theories of emotion agree that humans operate with much more nuanced emotion representations . Some recent computer science research has tried to go beyond the 1-dimensional sentiment measure, but we’re still very far from capturing the full richness of human emotions expressed through language . With our work (accepted at EMNLP 2017), we try to learn richer representations of emotional content in text than what has been done previously.
Why is this relevant? Because emotional content is an important part of language. The classic use case is companies wanting to make sense of what their customers are saying about them. But there are many other use cases as well now that NLP is becoming an increasingly important part of consumer products. For instance, all of the chatbot services (Siri, Alexa and many others) might benefit from having a nuanced understanding of emotional content in text.
I personally experienced the issue of the limited sentiment analysis capabilities when I recently wanted to examine trends in offensive language on social media with colleagues at MIT. We quickly found that even when using all publicly available datasets annotated for offensive language, it was not going to be enough for the model to get a decent understanding of the nuances of offensive language. Using pretrained word vectors helped, but these word vectors are trained to predict the surrounding words and thus treat e.g. “happy” and “sad” as quite similar due to them often occurring with similar surrounding words . Perhaps even more importantly, word vector models are often trained in a bag-of-words manner, making it difficult for them to capture the impact of negations (e.g. ‘not’), amplifications (e.g. ‘very’) as well as more complex sequential patterns in text.
Our model, however, does not suffer from this shortcoming. For instance, the model can capture slang such as ‘this is the shit’ being a positive statement as well as very varied usage of the word ‘love’ (see predictions from our model below). We also make an online demo available at deepmoji.mit.edu.
As is clear from the figure, our model is based on emojis. The basic idea is that if the model is able to predict which emoji was included with a given sentence, then it has an understanding of the emotional content of that sentence. We thus train our model to predict emojis on a dataset of 1.2B tweets (filtered from 55B tweets). We can then transfer this knowledge to a target task by doing just a little bit of additional training on top with the target dataset. With this approach we beat the state of the art across benchmarks for sentiment, emotion and sarcasm detection. Hopefully YOU can use it for a lot of other interesting purposes as well.
Understanding emoji usage
We teach our model an understanding of emotions by finding millions of tweets containing one of the top 64 emojis and ask the model to predict them in context. Just by examining the predictions of our model on the test set it is clear that the model does have an understanding of how the emojis are related. The model learns to group emojis into overall categories associated with e.g. negativity, positivity or love. Similarly, the model learns to differentiate within these categories, mapping sad emojis in one subcategory of negativity, annoyed in another subcategory and angry in a third one.
It’s actually an old idea to use part of the sentence as a ‘noisy label’ for a pretraining prediction task (it started with emoticons from forum messages), but to our knowledge we’re the first to have such an expanded set of 64 noisy labels, which we show helps substantially. Another important factor might be that we — unlike previous work — do not try to manually categorize which emotional category each noisy label belongs to. Such manual categorization requires an understanding of the emotional content of each expression, which is prone to misinterpretations and may omit important details regarding usage. To get a better understanding of the previous research see the related work section of our paper.
Now we’ll quickly go over some of the more technical details. If this is not of interest to you, you can just skip this next section.
Model architecture and benchmarking (technical details)
One challenge we faced was how to design our model and fine-tuning procedure such that it could after this emoji pretraining be used for a variety of new tasks. We started out with a classic 2-layer long short-term memory (LSTM) model, but quickly identified two issues:
- The features learned by the last LSTM layer might be too complex for the new transfer learning task, which could benefit more from also having access to layers earlier in the network.
- The model may be used for new domains, where the “understanding” of a specific word as given by its embedding in vector space will need to be updated. However, the datasets for new domains may be quite small and thus simply training the entire model with its 22.4M parameters will quickly cause overfitting.
Issue 1 is solved by adding a simple attentional mechanism to the LSTM model that takes all prior layers as input, thereby allowing easy access for the Softmax layer to any previous time step at any layer of the architecture. Issue 2 is solved by our proposed ‘chain-thaw’ fine-tuning procedure, which iteratively unfreezes part of the network and trains it. The procedure starts by training any new layers, then fine-tunes the first layer to the last layer individually and then finally trains the entire model (illustration below). This sounds like a mouthful, but it’s actually not that bad computationally as each layer just needs to be fine-tuned a little bit.
In our paper we show that our model architecture is indeed better for transfer learning. We also show that using such a rich set of emojis is beneficial as compared to the classic techniques of using positive/negative emoticons — this is the case even for positive/negative sentiment analysis!
To benchmark our model we found 8 benchmark datasets on 3 NLP tasks across 5 domains (see the paper on details for why these were selected). Our DeepMoji model outperforms the state of the art across all benchmark datasets with the new `chain-thaw’ approach consistently yielding the highest performance for the transfer learning (see below). The performance difference between the ‘chain-thaw’ and the ‘last’ approach is not that big for these tasks, but it seems to be larger for other tasks as we’ll see in a moment.
One big issue is the lack of proper emotion analysis benchmark datasets. For instance, the emotion dataset with the highest number of classes has 7 emotional categories. To resolve this issue, we’re trying to create a new emotion benchmark dataset that will hopefully help propel emotion analysis research forward.
Applying the model
We want to make it easy for others to use our model for any imaginable purpose. That’s why we release our code for preprocessing and an easy-to-use pretrained model for use with the Keras framework. It will soon be available on github.
As an example, let’s return to my original need to model offensive language. With only a few lines of code, we can preprocess a benchmark dataset  and fine-tune our model on it.
from deepmoji import SentenceTokenizer, finetune_chainthaw, define_deepmoji
vocab_path = '..'
pretrained_path = '..'
maxlen = 100
nb_classes = 2
# Load your dataset into two Python arrays, 'texts' and 'labels'
# Splits the dataset into train/val/test sets. Then tokenizes each text into separate words and convert them to our vocabulary.
st = SentenceTokenizer(vocab_path, maxlen)
split_texts, split_labels = st.split_train_val_test(texts, labels)
# Defines the DeepMoji model and loads the pretrained weights
model = define_deepmoji(nb_classes, maxlen, pretrained_path)
# Finetunes the model using our chain-thaw approach and evaluates it
model, acc = finetune_chainthaw(model, split_texts, split_labels)
Of course, a lot of things are going on under the hood here. If you want to play more with the model by extending the vocabulary, tuning the dropout rates or whatever — you can do that. All of our code is documented and there’s examples for how to do most things.
To compare with existing approaches we use two state-of-the-art methods from a recent paper  that use an LSTM model (either from scratch or with pretrained Glove embeddings) combined with a gradient boosted trees (GBT) classifier. Our model obtains an accuracy of 82.1%, whereas the state-of-the-art classifier obtains 75.6% (see table below for details). It’s interesting to see that the chain-thaw method helps the performance a lot on this dataset (79.6% -> 82.1%), which is likely due to the embeddings having to be retrained slightly for this task as compared to emoji prediction. It’s left for future work to analyze in depth how well our model works for offensive language detection across multiple benchmark datasets. Nevertheless, this shows that our model and chain-thaw method can be of use for other tasks outside of the ones considered in the paper.
Future of emotion analysis
This research is only a small step towards more sophisticated emotion analysis. We believe that two major contributions to the field would be:
- A proper benchmark dataset with more nuanced labels than positive/negative. Benchmarks drive ML research so this is critical.
- An analysis of the difference between the emotional content of a text being conveyed to an external observer (e.g. an MTurk worker) and the actual emotions being felt by the author of the text. This would hopefully allow for even more interesting social science research leveraging NLP methods.
We would like to help with these and that’s why we in collaboration with social scientists and psychologists have setup a small website to gather the needed data, which we will then share with the research community. You can help us improve the field of emotion analysis by telling us how you felt when tweeting. Just click here! Note that you can use the arrow keys to rate the tweets faster.
This research was done in collaboration with Alan Mislove, Anders Søgaard, Iyad Rahwan and Sune Lehmann. The research to further improve our understanding of emotional content in text also involves Nick Obradovich, Holly Shablack and Kristen Lindquist.
 The three main theories of emotion are:
- Basic emotion theory, arguing that we have 6–8 emotions that have arisen as part of evolution and that we to some degree share with animals.
- Appraisal theory, arguing that emotions happen in the response of a person by evaluating a list of mental check points of how this event will affect that person directly or indirectly.
- Conceptual act theory, arguing that emotions are defined purely from the society and culture that each individual is exposed to.
This very brief summary does not give justice to the entire field of emotion theory so please read the papers and books in this field if you’re interested.
 See the ‘related work’ section of our paper. Since we submitted the camera-ready version we’ve also become aware of the paper ‘EmoNet: Fine-Grained Emotion Detection with Gated Recurrent Neural Networks’ by Mageed et al. (ACL 2017), which tries to classify 24 manually specified emotional categories using hashtags. It’s great to see other researchers also interested in emotion analysis. Please contact us if you feel we’re missing any relevant papers!
 Word embeddings such as word2vec are trained for generic use by predicting the surrounding words and they thus treat “happy” and “sad” as quite similar seeing as they often occur with similar surrounding words. As part of our benchmark comparison we also tested Sentiment-Specific Word Embedding by Tang et al., but it yielded worse results than our other comparisons.
 Data is from https://www.kaggle.com/c/detecting-insults-in-social-commentary/data. 70% is used for training set, 10% for validation set and 20% for test set.