Is Sarcasm Difficult To Convey Through Text

Yang Lu
INST414: Data Science Techniques
3 min readMay 12, 2022

The age old question of whether “/s”, and other similar notations are necessary to convey sarcasm through text. This would be a classification machine learning problem. The non-obvious insight I wanted to extract is whether sarcasm is truly difficult to convey without an audible tone. The dataset I used is from the Sarcasm on Reddit dataset on Kaggle. The author of this dataset gathered it from Self-Annotated Reddit Corpus (SARC).

I split the data into a 80/20 train and test sets. Since the data concerns word, more especially what words signify sarcasm, I used term frequency-inverse document frequency, or TF-IDF. I also split the each sets into comment and label, since I am focusing on how to predict/categorize label from comment.

comment_train,comment_test,label_train,label_test = train_test_split(df[‘comment’].astype(str),df[‘label’],random_state=42)tfidfReg= Pipeline([(‘tfidf’, TfidfVectorizer(ngram_range=(1,2))),(‘clf’, LogisticRegression(random_state= 42, solver=’liblinear’))])tfidfReg.fit(comment_train,label_train)

The accuracy is tested using scikit-learn Pipeline’s score method.

print(f”The accuracy on the training set is: {tfidfReg.score(comment_train,label_train)}”)print(f”The accuracy on the test set is: {tfidfReg.score(comment_test,label_test)}”)

In combining the test set comment with test set label and predicted test set label, we can see visually see how well the model works.

a=eli5.explain_weights(tfidfReg)
print(eli5.format_as_dataframe(a)

For failure analysis, I compared the words in the comments against the common words that indicate sarcasm in the overall dataset. From the explanation data frame, it is evident that some words do convey sarcasm. However, it does not explain wrongly labeled comments such as “First season since 2007 that I didnt go to any CFB games :-(“ which it labeled as sarcasm. This could be because of the sad face emoticon, most of the train dataset’s non-sarcasm comments did not include emoticons. Another comment it got wrong was “But they’ll have all those reviews!” which it did not label as sarcasm. Another example is “Should have been a lunatic you get to build a pyramid for your cat.” which is false positive, “Hoiberg said “great players”, not Dwight” which is false negative, and another failure was “Oh, I never realized it was so easy, why had I, and every other lonely person on earth never thought of that before?” which was false negative.

These failures suggest that there is another factor besides words that would come into play when deciding whether sarcasm is intended or not. I think that the words itself do not provide context of which the comment is used for. For example, saying “hot potato” at a freshly baked potato versus “hot potato” at a rather unpleasant looking potato.

In doing some exploratory analysis on the dataset to explain these failures, there are somethings that the model does not seem to account for, such as punctuations.

Running a more condensed form of the dataset, I found that sarcastic comments tend to contain exclamation marks. Which would explain why the model failed the “But they’ll have all those reviews!”, interestingly the numbers were similar for both sarcastic and non sarcastic comments in terms of question marks.

In predicting whether a text is sarcastic or not, the overall modeling is not very accurate at around 70–80% accuracy. There are features besides words that would influence whether a text is sarcastic. However, there are some words that have a higher chance of being able to predict a sarcastic text, such as “obviously”, “yes, because”, or “clearly”. This is because these words obviously clearly shows sarcasm. Or do they.

--

--