A simple News Article Generator

In a past article, I trained a neural network model to generate news headlines:

https://medium.com/@andreasstckl/creating-news-headlines-with-ai-2d8c5bb76241

This model was a character level model and used headlines from Reuters news I collected as training data. As a character level model, the system builds up the headline character per character. You can find a good description of such models in https://towardsdatascience.com/character-level-language-model-1439f5dd87fe

In this article, I want to show how a simple word-level model can be created. It can be used to build up some paragraphs of a news text word by word. The system is trained on Austrian news text and therefore generates text in the German language.

The Data

I collected news articles from Austrian online newspapers to train the neural network from:

The news texts are stored in one text file and loaded into a variable:

I use the tokenizer from the NLTK package to split the text into a list of tokens. Uppercase is preserved and punctuations are treated as tokens and not filtered.

Preprocessing

The tokenizer from Keras (https://keras.io/) is used to transform this list of tokens into a list of integers. The size of the vocabulary is limited to 10.000 and so only the 10.000 most common words are encoded. This is done to save memory and computation time.

The model uses the last 5 words to predict the next word. So I prepare the training data by splitting the whole list of integers into sequences of length 6. This gives us a total of more than 3 million sequences for training.

The first 5 numbers of each sequence are used as input features and the last one as target label. This gives the feature matrix X and the vector of labels y.

Training

For building the model I use Keras with an embedding layer, LSTM layers and a dense layer with softmax activation for the calculation of the probability distribution over the vocabulary. An introduction about these topics can be found in https://medium.com/@XiwangLi/nlp-from-zero-to-one-for-classification-and-machine-translation-part-one-23221b8ee2ef

Before fitting the model I define a generator function which prepares the batches of training data by selecting random rows of features and labels. The labels are converted to One-Hot encodings with the Keras helper function “to_categorical” and passed to the fit-function of the model.

The model is compiled with ‘categorical_crossentropy’ as loss function and “Adam” as an optimizer and trained over 50 epochs.

Training is done with a jupyter notebook on AWS SageMaker (https://aws.amazon.com/de/sagemaker/) a fully-managed service that covers the machine learning workflow to train the model and make predictions. As hardware, I am using an “ml.p3.2xlarge” instance. The machine is GPU accelerated and uses an NVIDIA Tesla V100 GPU (https://www.nvidia.com/en-us/data-center/tesla-v100/).

Generating text

The function for generating news takes a model and a seed text as input and generates some new words. If the seed text is shorter than the length of the input of the model the sequence is padded with zeros. The new words are selected at random according to the probability distribution of the model.

Some Examples of generated texts:

The texts are far away from correct or even good news articles. More training, data and better models are needed. I will cover this in a future post, where transfer learning will be a key step to improve performance.

Data Driven Investor

from confusion to clarity, not insanity

Andreas Stöckl

Written by

University of Applied Sciences Upper Austria / School of Informatics, Communications and Media http://www.stoeckl.ai/profil/

Data Driven Investor

from confusion to clarity, not insanity