Generating News Headlines with Machine Learning

Extra! Extra! Read All About It!

Code AI Blogs

Published in

CodeAI

4 min readAug 23, 2021

Introduction
Data Source
Setup
Exploratory Data Analysis
Training the Model
Results
Conclusion

Introduction

We can build classifiers to detect sarcasm in news headlines, but what about generating news headlines from scratch?

My goal is to build a machine learning model that can do just that.

To achieve this, I’ll use textgenrnn, a neural network architecture that allows you to easily train a text-generating neural network of any size and complexity on any text dataset. You can read more about textgenrnn here:

GitHub - minimaxir/textgenrnn: Easily train your own text-generating neural network of any size and…

Easily train your own text-generating neural network of any size and complexity on any text dataset with a few lines of…

github.com

Data Source

I’ll be using this dataset from Kaggle:

A Million News Headlines

News headlines published over a period of 18 Years

www.kaggle.com

It contains the publish date and headline text of news headlines from the Australian Broadcasting Corporation as a single CSV file.

Setup

First things first, I pip and import the necessary Python libraries.

import os
import pandas as pd
import numpy as np
from textgenrnn import textgenrnn
import time

Now we’ll load our data into a Pandas DataFrame:

filename = "abcnews-date-text.csv"
with open(os.path.join(".", "data", filename), "rb") as f:
    df = pd.read_csv(f)

Let’s also set up a few constants for our machine learning model:

NUM_EPOCHS = 1
PERCENT_TRAINING = 0.5

Exploratory Data Analysis

To get a better sense of the dataset we’re working with, let’s dive into some data analysis!

Results:

We can see that we have over a million data entries and no null values. From the number of unique values we have in our dataset, we also know that we have some duplicate headlines that we need to drop.

In addition to the duplicates, we’ll also discard half of our data. We do this because training the model takes quite a few hours, and the more data we have, the harder it is and the longer it takes to train.

Training the Model

We’re now ready to train our machine learning model!

We’ll start by converting the headline text into a NumPy array and initializing the model.

arr = df["headline_text"].to_numpy()
model = textgenrnn()

Next, we’ll set the necessary parameters for our model:

Now we’ll train and save the model using the specified parameters. Saving the model will allow us to quickly load and use the trained model when needed.

Finally, we’ll define a range of temperatures, which are a gauge of how wacky the generated headlines will be. Temperatures near 0 will produce very mundane headlines based on the most common patterns the neural network sees. The higher the temperature, the more the model will experiment and try to generate something new.

When we specify a prefix, the trained model will build a news headline based on it.

Results

Here are some samples of the news headlines generated by our trained model!

With no specified prefix:

With covid-19 as the specified prefix:

From our results, we can clearly see that with lower temperatures, there are few differences (if any) between generated news headlines, as the neural network is playing it safe. With high temperatures, there is much more variety but the headlines also start to make less sense.

From the frequent mentions of Australia and areas within Australia, we can also see some of the biases present in our dataset, which is expected as all of our data is from the Australian Broadcasting Corporation.

Conclusion

We’ve successfully trained a recurrent neural network to generate news headlines!

Here are some ideas for taking this project further:

Experiment with other machine learning models, such as Markov chains
Explore other text generation datasets, like movie titles and song names