Using Deep Learning to Identify Sarcasm

9 min readMar 29, 2020

What is sarcasm?

In simple terms, the use of remarks that clearly mean the opposite of what you say, used to mock someone or point out an irony is called ‘sarcasm’.
For example, using the phrase: ‘That’s just what I need today’ when something bad happens is example of a sarcastic statement.

Why do we need to identify sarcasm?

Understanding sarcasm helps us understanding the natural language better. Taking a use case of identifying sarcasm in product reviews on an e-commerce website, understanding sarcasm can help us find the credibility of the review. For example, consider the following product on an e-commerce website:

One of the reviews posted on this product with 5 stars was:

Upon reading we can clearly understand that this review is not credible, the reviewer is mocking the product for it’s high price and is not a helpful review/feedback on the product.
Understanding sarcasm on the internet can help us understand the sentiment of the text and can be applied to various use cases like: understanding sarcasm in tweets on twitter, understanding sarcasm in comments on reddit, identifying sarcastic product reviews on e-commerce, etc.

Why is it difficult to detect sarcasm in texts?

Sarcasm can actually depend alot on background information which cannot be very obvious from the raw text itself. The sentence ‘That’s just what I need today’ is a sarcastic phrase given something bad happens at the time when comment was made. Just the phrase it not sarcastic but the context in which it was said makes it a sarcasm. Similarly, the comment on an e-commerce website ‘Must buy! so cheap!!’ is only sarcastic when the product is very heavily priced. Quoting from a research paper in this domain (https://arxiv.org/abs/1805.06413):

The use of sarcasm relies on context, which involves the presumption
of commonsense and background knowledge of an event.

This makes detecting sarcasm from raw text difficult. We’ll try to address this issue by introducing some context based information and see if injecting this additional information helps us in improving the results.

Using Deep Learning for the task

Deep learning has been proven to be very successful for Natural language tasks like Machine translation, sentiment analysis, text classification, etc. (More can be found here: https://paperswithcode.com/area/natural-language-processing)
In this blog post, we’ll apply Deep Learning models for the task of identifying sarcastic comments. We’ll formalize the problem and also introduce the dataset used for these experiments.

Dataset: SARC

Self-Annotated Reddit Corpus (SARC) is a large dataset of comments which are scraped from very popular online discussion platform Reddit and is created by a team from Princeton university. This dataset contains a total of 533 million comments, each self-annotated as sarcastic or not, 1.33 million of which are sarcastic comments.
Important links related to dataset:

Dataset source 1: https://www.kaggle.com/danofer/sarcasm
Dataset source 2: https://nlp.cs.princeton.edu/SARC/2.0/
https://github.com/NLPrinceton/SARC
Link to SARC paper: http://www.lrec-conf.org/proceedings/lrec2018/pdf/160.pdf

I’ve downloaded the dataset from Kaggle and upon loading it into a dataframe we can see the following snapshot of our dataset:

Snapshot of our dataset loaded into pandas dataframe

We have the following columns in our dataset:

comment: Actual comment text
author: author of the comment
subreddit: subreddit under which the comment was posted. subreddit can be thought of as discussion topic on reddit
score/ups/downs: Score of this comment, along with upvotes and downvotes
date/created_utc: Date on which the comment was posted
parent_comment: The parent comment of this comment in reddit comment tree.
label: The actual label given to this comment. 1 => sarcastic, 0 => not-sarcastic

Exploratory Data Analysis (EDA)

I’ve used a subset of the downloaded dataset containing 1M points. This dataset contains 500K sarcastic comments and 500K non-sarcastic comments. I’ve divided this data into three sets: train data -> containing 800K data points; 400K of each type, cross-validation data -> containing 100K data points; 50K of each type and remaining in test dataset.

The above code snippet gives us the total number of unique subreddits in our dataset, which are ~13000 and top subreddits with sarcastic comments are:

Apparently, politics is a topic which invites a lot of sarcasm.
Let’s do the same analysis for authors.

We have ~228K unique authors in our dataset and authors with most sarcastic comments are:

We can find the words that appear most in sarcastic or non-sarcastic comments. This information can help us classify the comments to right category. We can use the following code snippet. This will give us the wordcloud for non-sarcastic comments. Why non-sarcastic, because sliced train_bal[‘label’] = 0

Similarly we can get a word cloud for sarcastic comments, by changing train_bal[‘label’] = 1

What about the length of the comments? Can they help us differentiate between the two categories? Let’s examine this using the following code snippet:

We’ll get the following histogram

We see no clear distinction in the lengths of comments of the two categories. Length of the comment is not a good differentiator.

Data Cleaning

As we are dealing with text data, it’s necessary for us to apply data cleaning on our text to make it suitable for deep learning models. Applying apt cleaning to our data can improve our model’s performance alot. We can use the following code snippet to clean our data:

To remove emoticons, we’ve used an exhaustive list of all emoticons maintained here: https://github.com/NeelShah18/emot/blob/master/emot/emo_unicode.py

Deep Learning Models

Now we’ve come to the main part. We’ll prepare a deep learning model that will classify a comment as sarcastic or not. We’ll use our train dataset for training our model, cross validation dataset for tuning our model and test dataset for calculating the final performance of our model. But first, we’ll convert our raw text data to a format that our model can understand. Mathematical models don’t understand words, they understand numbers. So we’ll convert our raw text to meaningful numbers that our model can understand.

Preparing data for models

Consider the following code snippet:

Let’s try to understand what we did above.

We’ll use train dataset to create a dictionary of words. Each unique word in our training data will be in this dictionary and each word will have a unique number assigned to it. For example: if this dictionary contains 3 words A, B, C than A can have number 1 associated with it, B can have 2 and so on.
Function fit_on_texts() created this dictionary and assigned unique numbers to each word. Function texts_to_sequences() converted our sentences from list of words to list of numbers associated to those words. For example: the sentence ‘i love mondays’ will now be [3, 34, 567] if word ‘i’ = 3, ‘love’ = 34, ‘mondays’ = 567 in our dictionary.
Next, sentences can be of varied length. We want to normalize this and make all our sentences of same length, we’ve used maximum length of our sentences to be 50 max_length = 50 . Sentences longer than 50 words will be trimmed and sentences shorter than 50 words will be padded with zeroes at the end. Function pad_sequences() helps us in doing this.
Finally, we have a list of numbers representing our sentences and suitable to plug into deep learning models!

We’ll also prepare convert our y’s like to following so that they are suitable for our models. We are doing this because we will be using ‘categorical_cross_entropy’ as our loss function. More on this later.

Preparing embedding matrix for embedding layer

We’ll use an embedding layer in our Keras model and we’ll use Word2Vec embeddings as pre-trained weights for this embedding layer. More about embedding layer and w2v embeddings can be found here:

In a nutshell, we’ll use a 300-dim numerical representation for each word in our model and feed these representations to our DL model for good results. For each word, we’ll have a 300-dim vector so our embedding matrix will be of shape |V| x 300 where |V| is the size of our vocabulary. We can create this matrix using the following code:

Model 1: Content based input

For the first model, we’re trying a simple sequential model using 1-D CNNs and Maxpooling layers. We’ve used only content-based input for this model. By content based, we mean that we’ve used information from raw comments only as input to this model. The model architecture is as follows:

This model architecture is taken from this paper: https://arxiv.org/abs/1610.08815. This paper explores the results of deep learning on twitter datasets. We’ve extracted the useful things from this paper and tried to apply it to the SARC dataset. Following is our model definition, compilation and training code:

Points to note:

We’ve used embedding_matrix that we created earlier here in embedding layer
We’re using ‘categorical_crossentropy’ as loss function to train our model
We’ll measure metrics like ‘accuracy’ and ‘F1-score’

Results of Model 1

Let’s evaluate our model on test dataset and see the classification report. We’ll use the below code snippet.

We get an F1-score of 0.72 using this model.

Developing context features

As we saw in a previous section how sarcasm detection depends on the context alot. So in this section we’ll develop some features that can represent the context in which the comment was said. Although it’s difficult to extract and represent exact context of each comment, we’ll take some ideas from a brilliant paper called CASCADE (https://arxiv.org/abs/1805.06413) written on this subject. To quote from this paper:

People possess their own idiolect and authorship styles, which is reflected in their writings. These styles are generally affected by attributes such as gender, diction, syntactic influences, etc. We use this motivation to learn stylometric features of the users by consolidating their online comments into documents.

In a nutshell: For each comment in our data we have an author ID associated with it. So we’ll learn a vector representation for each author in our data. Then along with inputting vectors which represent our comment we’ll also input these user representations to our model. This will act as our context for now. Steps followed for finding user representations:

For each author A in train data, collect all the comments written by A in a long string
Using this author-document pair train something called Paragraph vectors. (also called doc2vec, similar to word2vec; more can found here and here)
Extract user representation for each author in train, cv and test datasets from the model trained in step 2

So like this we’ll get our context information in the form of user sylometric features.

Model 2: Content + Context based input

In addition to the raw content input to our CNN model this time we’ll also add the context information extracted above and see if any improvements are observed. Our model looks like this now:

We’ll train this model on content based features + context based features.

Results of Model 2

Evaluating this model on our test dataset gives us the following results:

We get an F1-score of 0.727 using this model.

Conclusions

We’ve applied DL models to the task of detecting sarcastic comments and achieved a F1-score of 0.727. We’ve also tried to include context information in the form of user embeddings but they did not improve our results alot. There are other ways as well to introduce context mentioned in the paper: https://arxiv.org/abs/1805.06413.

References

A Deeper Look into Sarcastic Tweets Using Deep Convolutional Neural Networks — https://arxiv.org/abs/1610.08815
CASCADE: Contextual Sarcasm Detection in Online Discussion Forums — https://arxiv.org/abs/1805.06413
SARC: http://www.lrec-conf.org/proceedings/lrec2018/pdf/160.pdf
https://www.kaggle.com/danofer/sarcasm/kernels
Applied AI Course — https://www.appliedaicourse.com/