Generating News Headlines with Machine Learning
Extra! Extra! Read All About It!
Introduction
We can build classifiers to detect sarcasm in news headlines, but what about generating news headlines from scratch?
My goal is to build a machine learning model that can do just that.
To achieve this, I’ll use textgenrnn, a neural network architecture that allows you to easily train a text-generating neural network of any size and complexity on any text dataset. You can read more about textgenrnn here:
Data Source
I’ll be using this dataset from Kaggle:
It contains the publish date and headline text of news headlines from the Australian Broadcasting Corporation as a single CSV file.
Setup
First things first, I pip and import the necessary Python libraries.
import os
import pandas as pd
import numpy as np
from textgenrnn import textgenrnn
import time
Now we’ll load our data into a Pandas DataFrame:
filename = "abcnews-date-text.csv"
with open(os.path.join(".", "data", filename), "rb") as f:
df = pd.read_csv(f)
Let’s also set up a few constants for our machine learning model:
NUM_EPOCHS = 1
PERCENT_TRAINING = 0.5
Exploratory Data Analysis
To get a better sense of the dataset we’re working with, let’s dive into some data analysis!
Results:
We can see that we have over a million data entries and no null values. From the number of unique values we have in our dataset, we also know that we have some duplicate headlines that we need to drop.
In addition to the duplicates, we’ll also discard half of our data. We do this because training the model takes quite a few hours, and the more data we have, the harder it is and the longer it takes to train.
Training the Model
We’re now ready to train our machine learning model!
We’ll start by converting the headline text into a NumPy array and initializing the model.
arr = df["headline_text"].to_numpy()
model = textgenrnn()
Next, we’ll set the necessary parameters for our model:
Now we’ll train and save the model using the specified parameters. Saving the model will allow us to quickly load and use the trained model when needed.
Finally, we’ll define a range of temperatures, which are a gauge of how wacky the generated headlines will be. Temperatures near 0 will produce very mundane headlines based on the most common patterns the neural network sees. The higher the temperature, the more the model will experiment and try to generate something new.
When we specify a prefix, the trained model will build a news headline based on it.
Results
Here are some samples of the news headlines generated by our trained model!
With no specified prefix:
With covid-19 as the specified prefix:
From our results, we can clearly see that with lower temperatures, there are few differences (if any) between generated news headlines, as the neural network is playing it safe. With high temperatures, there is much more variety but the headlines also start to make less sense.
From the frequent mentions of Australia and areas within Australia, we can also see some of the biases present in our dataset, which is expected as all of our data is from the Australian Broadcasting Corporation.
Conclusion
We’ve successfully trained a recurrent neural network to generate news headlines!
Here are some ideas for taking this project further:
- Experiment with other machine learning models, such as Markov chains
- Explore other text generation datasets, like movie titles and song names