Detecting Headline Sarcasm with Machine Learning

Training ML models to recognize sarcasm better than I can

Code AI Blogs

Published in

CodeAI

5 min readApr 27, 2021

Introduction
Data Source
Setup
Exploratory Data Analysis
Data Preprocessing
Training the Model
Conclusion
References

Introduction

Sarcasm can be incredibly difficult to spot over the internet. So, why not train machine learning models to discern it for us?

In this article, I go over how to build machine learning models that can detect sarcasm in news headlines.

I based my data preprocessing and deep learning model on the steps shown in this text classification tutorial by Google Developers:

Introduction | ML Universal Guides | Google Developers

"type": "thumb-down", "id": "missingTheInformationINeed", "label":"Missing the information I need" },{ "type"…

developers.google.com

Data Source

I used version 1 of this dataset for this project:

News Headlines Dataset For Sarcasm Detection

High quality dataset for the task of Sarcasm Detection

www.kaggle.com

The sarcastic news headlines are from The Onion, while nonsarcastic ones are from HuffPost.

Each data entry consists of three attributes:

is_sarcastic: 1 if the sample is sarcastic, 0 otherwise
headline: the headline of the news article
article_link: link to the original news article

Setup

First things first, I import the Python libraries that I’ll need for this project:

import numpy as npimport pandas as pd
import matplotlib.pyplot as pltimport seaborn as sns
from sklearn.model_selection import train_test_splitfrom sklearn.feature_extraction.text import CountVectorizerfrom sklearn.feature_extraction.text import TfidfVectorizerfrom sklearn.feature_selection import SelectKBestfrom sklearn.feature_selection import f_classif

Then I load the dataset into a Pandas DataFrame. In the process, I drop the article links as I don’t use them in this project.

Now we can take a look at the full text of the first few headlines:

Exploratory Data Analysis

To start, let’s quickly double-check that we have no missing values and that there are only two labels in the is_sarcastic column, 0 for a nonsarcastic headline and 1 for sarcastic:

We have 26709 samples and no missing values in this dataset.

For further analysis, I’ll group our dataset by class:

Now we can check the number of samples per class:

We have 11724 sarcastic samples and 14985 nonsarcastic samples. There is a slight class imbalance, but it’s small enough that I’ll ignore it for this project. As for the median number of words per headline, it is 10 for both classes.

Here are the plots of the sample length distribution for both sarcastic and nonsarcastic headlines:

Next, I’ll take a look at the frequency distribution of n-grams. Here are the plots of the 50 most common n-grams for both sarcastic and nonsarcastic headlines:

Data Preprocessing

According to Google Developers’ text classification tutorial, the ratio between the number of samples and the number of words per sample determines whether an n-gram model or a sequence model will perform better on a given dataset. Specifically, n-gram models perform better or at least as well as sequence models for datasets with a proportion smaller than 1500.

To determine which category our dataset falls into, I’ll calculate the ratio:

In this dataset, the ratio is less than 1500 for both classes, so we’ll go with n-gram models.

Now I’ll shuffle and split the dataset into training and validation sets in a 4:1 ratio:

Finally, I tokenize and vectorize the headlines in both sets, using the top 20 000 n-grams as features.

Training the Model

At long last, we’re ready to build and train our models! Multilayer perceptrons are the recommended type of n-gram model in the text classification tutorial, so that’s what I’ll construct in this article. But before creating a multilayer perceptron model, let’s try a few machine learning algorithms from the scikit-learn library. Here’s a helpful cheat sheet for choosing the best estimator for your machine learning task:

Choosing the right estimator - scikit-learn 0.24.1 documentation

Often the hardest part of solving a machine learning problem can be finding the right estimator for the job. Different…

scikit-learn.org1.

Linear Support Vector Classification

The trained model provides an accuracy of 85.8% on the validation set. I also plot the confusion matrix, which is another performance metric for classifiers.

In a confusion matrix, entry i, j is the number of observations in group i but predicted to be in group j.

This provides information about how the model performs on each class, which is vital for imbalanced datasets. With a class imbalance, the model may achieve high validation accuracy by being biased towards the majority class alone. You can read more about confusion matrices here:

Understanding Confusion Matrix

towardsdatascience.com

2. Complement Naive Bayes

For the complement naive Bayes model, the validation accuracy is 85.3%.

3. Multilayer Perceptron

Lastly, I build and train a deep learning model using Keras.

I experimented with a variety of hyperparameter values, including batch size, regularization, and dropout. The results shown here are from my best trial run.

After training, I plot the loss curve, accuracy curve, and confusion matrix:

Within the loss plot, we see that both training and validation loss decrease sharply before leveling off. The opposite is true for the accuracy plot, where both training accuracy and validation accuracy increase.

Conclusion

We’ve successfully trained multiple models to detect sarcasm in news headlines! All of our trained models reached a validation accuracy of 85–86%. But what if we want to do better? Here are some ideas for improving model performance:

Tune the hyperparameters of the scikit-learn models by going through the estimators’ documentation
Tune the hyperparameters of the multilayer perceptron model using the Keras Tuner as illustrated in this tutorial:

Introduction to the Keras Tuner | TensorFlow Core

TensorFlow Lite for mobile and embedded devices

www.tensorflow.org

You can also try training a sequence model and comparing its performance to the n-gram models created in this walkthrough!

References

In addition to the ones linked throughout this article, I wouldn’t have been able to complete this project without the help of these awesome examples and tutorials:

[1] Udacity | Intro to TensorFlow for Deep Learning by TensorFlow

[2] TensorFlow | Simple audio recognition: Recognizing keywords

[3] TensorFlow | Overfit and underfit