Detecting Headline Sarcasm with Machine Learning

Training ML models to recognize sarcasm better than I can

Code AI Blogs
CodeAI
5 min readApr 27, 2021

--

Introduction

Sarcasm can be incredibly difficult to spot over the internet. So, why not train machine learning models to discern it for us?

In this article, I go over how to build machine learning models that can detect sarcasm in news headlines.

I based my data preprocessing and deep learning model on the steps shown in this text classification tutorial by Google Developers:

Data Source

I used version 1 of this dataset for this project:

The sarcastic news headlines are from The Onion, while nonsarcastic ones are from HuffPost.

Each data entry consists of three attributes:

  • is_sarcastic: 1 if the sample is sarcastic, 0 otherwise
  • headline: the headline of the news article
  • article_link: link to the original news article

Setup

First things first, I import the Python libraries that I’ll need for this project:

Then I load the dataset into a Pandas DataFrame. In the process, I drop the article links as I don’t use them in this project.

Now we can take a look at the full text of the first few headlines:

Exploratory Data Analysis

To start, let’s quickly double-check that we have no missing values and that there are only two labels in the is_sarcastic column, 0 for a nonsarcastic headline and 1 for sarcastic:

We have 26709 samples and no missing values in this dataset.

For further analysis, I’ll group our dataset by class:

Now we can check the number of samples per class:

We have 11724 sarcastic samples and 14985 nonsarcastic samples. There is a slight class imbalance, but it’s small enough that I’ll ignore it for this project. As for the median number of words per headline, it is 10 for both classes.

Here are the plots of the sample length distribution for both sarcastic and nonsarcastic headlines:

Next, I’ll take a look at the frequency distribution of n-grams. Here are the plots of the 50 most common n-grams for both sarcastic and nonsarcastic headlines:

Data Preprocessing

According to Google Developers’ text classification tutorial, the ratio between the number of samples and the number of words per sample determines whether an n-gram model or a sequence model will perform better on a given dataset. Specifically, n-gram models perform better or at least as well as sequence models for datasets with a proportion smaller than 1500.

To determine which category our dataset falls into, I’ll calculate the ratio:

In this dataset, the ratio is less than 1500 for both classes, so we’ll go with n-gram models.

Now I’ll shuffle and split the dataset into training and validation sets in a 4:1 ratio:

Finally, I tokenize and vectorize the headlines in both sets, using the top 20 000 n-grams as features.

Training the Model

At long last, we’re ready to build and train our models! Multilayer perceptrons are the recommended type of n-gram model in the text classification tutorial, so that’s what I’ll construct in this article. But before creating a multilayer perceptron model, let’s try a few machine learning algorithms from the scikit-learn library. Here’s a helpful cheat sheet for choosing the best estimator for your machine learning task:

  1. Linear Support Vector Classification

The trained model provides an accuracy of 85.8% on the validation set. I also plot the confusion matrix, which is another performance metric for classifiers.

In a confusion matrix, entry i, j is the number of observations in group i but predicted to be in group j.

This provides information about how the model performs on each class, which is vital for imbalanced datasets. With a class imbalance, the model may achieve high validation accuracy by being biased towards the majority class alone. You can read more about confusion matrices here:

2. Complement Naive Bayes

For the complement naive Bayes model, the validation accuracy is 85.3%.

3. Multilayer Perceptron

Lastly, I build and train a deep learning model using Keras.

I experimented with a variety of hyperparameter values, including batch size, regularization, and dropout. The results shown here are from my best trial run.

After training, I plot the loss curve, accuracy curve, and confusion matrix:

Within the loss plot, we see that both training and validation loss decrease sharply before leveling off. The opposite is true for the accuracy plot, where both training accuracy and validation accuracy increase.

Conclusion

We’ve successfully trained multiple models to detect sarcasm in news headlines! All of our trained models reached a validation accuracy of 85–86%. But what if we want to do better? Here are some ideas for improving model performance:

  • Tune the hyperparameters of the scikit-learn models by going through the estimators’ documentation
  • Tune the hyperparameters of the multilayer perceptron model using the Keras Tuner as illustrated in this tutorial:

You can also try training a sequence model and comparing its performance to the n-gram models created in this walkthrough!

References

In addition to the ones linked throughout this article, I wouldn’t have been able to complete this project without the help of these awesome examples and tutorials:

[1] Udacity | Intro to TensorFlow for Deep Learning by TensorFlow

[2] TensorFlow | Simple audio recognition: Recognizing keywords

[3] TensorFlow | Overfit and underfit

--

--