Generating News Headlines with Machine Learning

Extra! Extra! Read All About It!

Code AI Blogs
CodeAI
4 min readAug 23, 2021

--

Photo by Roman Kraft on Unsplash

Introduction

We can build classifiers to detect sarcasm in news headlines, but what about generating news headlines from scratch?

My goal is to build a machine learning model that can do just that.

To achieve this, I’ll use textgenrnn, a neural network architecture that allows you to easily train a text-generating neural network of any size and complexity on any text dataset. You can read more about textgenrnn here:

Data Source

I’ll be using this dataset from Kaggle:

It contains the publish date and headline text of news headlines from the Australian Broadcasting Corporation as a single CSV file.

Setup

First things first, I pip and import the necessary Python libraries.

Now we’ll load our data into a Pandas DataFrame:

Let’s also set up a few constants for our machine learning model:

Exploratory Data Analysis

To get a better sense of the dataset we’re working with, let’s dive into some data analysis!

Results:

We can see that we have over a million data entries and no null values. From the number of unique values we have in our dataset, we also know that we have some duplicate headlines that we need to drop.

In addition to the duplicates, we’ll also discard half of our data. We do this because training the model takes quite a few hours, and the more data we have, the harder it is and the longer it takes to train.

Training the Model

We’re now ready to train our machine learning model!

We’ll start by converting the headline text into a NumPy array and initializing the model.

Next, we’ll set the necessary parameters for our model:

Now we’ll train and save the model using the specified parameters. Saving the model will allow us to quickly load and use the trained model when needed.

Finally, we’ll define a range of temperatures, which are a gauge of how wacky the generated headlines will be. Temperatures near 0 will produce very mundane headlines based on the most common patterns the neural network sees. The higher the temperature, the more the model will experiment and try to generate something new.

When we specify a prefix, the trained model will build a news headline based on it.

Results

Here are some samples of the news headlines generated by our trained model!

With no specified prefix:

With covid-19 as the specified prefix:

From our results, we can clearly see that with lower temperatures, there are few differences (if any) between generated news headlines, as the neural network is playing it safe. With high temperatures, there is much more variety but the headlines also start to make less sense.

From the frequent mentions of Australia and areas within Australia, we can also see some of the biases present in our dataset, which is expected as all of our data is from the Australian Broadcasting Corporation.

Conclusion

We’ve successfully trained a recurrent neural network to generate news headlines!

Here are some ideas for taking this project further:

  • Experiment with other machine learning models, such as Markov chains
  • Explore other text generation datasets, like movie titles and song names

--

--