Developing a End-to-End NLP text generator application(part 1)

Kevin MacIver
7 min readApr 30, 2020

In an era of information and technology, written news is a very important means of communication. With a high volume of articles being published every day, delivering news fast is crucial.

In this series of stories we’ll go through the steps to develop an end-to-end product to help write news articles by suggesting the next words of the text.

This is an on-going project. Listed below are the steps envisioned during this project development:

Part 1: Train NLP models (recurrent and transformers neural networks) on a collection of tens of thousands of news articles to predict the next word of a text.

Part 2: Create a full-stack application using FLASK

Part 3: Containerize using docker and test locally

Part 4: Deploy app on Google Cloud using Kubernetes.

Part 5: Create a CI/CD architecture to continuously train the model with new data.

Architecture for CI/CD using Google Cloud

Generating the Model

Initial Data

We’ll begin our models with the AG’s News Topic Classification Dataset available at:

AG is a collection of more than 1 million news articles gathered from more than 2000 news sources. The dataset also comes with a classification of the topics of the articles between: World, Sports, Business and Sci/Tech.

We’ll start by filtering only business articles to feed into our model.

AG news dataset filtered by Business

We’ll also apply some cleaning techniques to remove news source, url links and save the lines in a text file.

Generating a Vocabulary

Our communication is based on a vocabulary, i.e. a set of words which you know and can relate to a meaning in some way. For the model we also need to set a vocabulary where it can focus and learn the patterns on how the known words relate.

In order to achieve this, we begin by tokenizing the text.

text_to_word_sequence function from keras.preprocessing.text

The data results in 1,186,071 tokens. Now we need to transform those tokens into a set of words we want to learn (vocabulary).

To do that we create a dictionary from the tokens and count the number of times each word appeared in the data.

Finally, we need to retain only words that appear with certain frequency.

For this data we’ll retain the 20% most frequent words, given a total of 5566 words.

Since the model will be trained on the most common words, a Wildcard Token will be added to the vocab in order to deal with words outside of the top words vocabulary.

Word Embedding

Many techniques have been created in order to help machines better understand and process languages, with many researches focusing on different problems, such as: sentiment analysis, text summarization, language translation, and text generation.

A fundamental part for applying these techniques is word embedding. I encourage all of you to check this story from Shay Palachy which summarizes most of the techniques being used in this field:

For this project we’ll leverage the GloVe project to obtaining our vector representations for words.

Loading GloVe Embedding

Although the GloVe model contains 400,000, we need to consider that it is possible to have some Out-Of-Vocabulary words (OOV). For those words we will just consider a vector space of zero.

Assigning embedding to vocabulary

Text Generation Models

There are many approaches towards text generation models, applying different techniques, embedding and architectures.

For this project we’ll focus on applying recurrent neural networks to calculate the probability of a word appearing given a certain sequence of previous words.

Recurrent Neural Network Approach

Recurrent neural networks have the quality to learn sequential data and can be useful for this type of application.

The limitation of this approach stands for the fact that intrinsically the model isn’t learning the language, as that it doesn’t relate for example that the word “good” in the image above, refers to the word “movie”.

Other architectures such as transformers apply a technique called “attention”, where the words relations within a text can be learned. Google GPT-2 is a famous model that has gathered lots of attention (no pun intended) in this field.

Check out the following links if you want to learn more:

Okay, getting back to our recurrent model. We need to feed it a certain sequence of words as input and train it to predict the following word. We’ll use a sequence of 5 words as input in this project.

We begin by reading the dataset and dividing the inputs as follows:

Generating Input and Labels for the Model

We also need to be careful to generate the inputs for each line of the text individually. Otherwise the inputs and labels can be disconnected of meaning if they come from two different sentences or, in this case, headlines.

Before generating the inputs we also separate the text lines into training and validation.

With a list of input words and labels it’s time to create batch generators for training.

Batch Generator

The generator randomly picks a section of the input list and transform that into an array with the respective embedding.

In this example we end up with batches of the following shape for input and labels respectively:

Inputs and Label Batch Shape

The same process is applied to create validation batches for the model.

Model Architecture

The following model architecture was applied:

The model uses a bidirectional LSTM stacked with a unidirectional LSTM.

The concept behind this architecture is that the bidirectional LSTM can better capture dependencies in the input sequence, which then are fed to the unidirectional LSTM and dense layer to predict the probabilities of the next word.

Model Evaluation

For model training the loss function used was the sparse categorical crossentropy, since it works the same way as categorical cross entropy with the advantage of saving time in memory and computation once it uses a single integer for a class, rather than a one-hot encoded vector.

In order to evaluate the model two metrics where define. The first one checks the percentage of times where the label was within the top 50 candidates for next word. The second metric evaluates the models’ uncertainty by averaging one minus the probability assigned to the label within the top 50 candidates.

Function for generating top-50 candidates
Metric function for uncertainty
Metric for time label is within top-50 candidates

A total of 5 experiments were run for 20 epochs with different activation functions for the bidirectional LSTM, the results were the following:

From the results above there seems to be a correlation between the two metrics, as uncertainty increases so does the percentage of times the labels is within the top 50 candidates.

The ideal model should have a total uncertainty of 0% and a percentage of 100% for the times the labels were within the candidates.

Based on the results from the activation functions, “tanh” appears to have a good balance of uncertainty and predicting the label within the candidates.

To corroborate with the assumption that bidirectional LSTMs could perform better than unidirectional LSTMs, both models were run for a total of 56 epochs:

The results show that the bidirectional model presents a lower uncertainty and a higher hit rate then the unidirectional model. So, our assumption seems to be correct 😅.

Conclusion of Part 1

Now that we got our first model trained, we will create an application to use it.

On part 2 we will see the steps taken to develop a full-stack application using FLASK.

Thanks for reading.

--

--

Kevin MacIver

Driven for innovation, waiting for the robots uprising..