Text classifier with Keras+TensorFlow using Recurrent Neural Networks

Published in

Coinmonks

7 min readJun 22, 2018

Recurrent Neural Networks (RNN) can be used to analyze text sequences and assign a label according a parameter. For instance, in the Keras examples they are used to classify IMDB movie reviews as positive or negative.

In this example we will use a RNN to train a classifier to solve a problem closely related to the other stories in the “series” about the use of LSTM (Long Short-Term Memory) to automatically generate music lyrics learning the “style” from a corpus of a particular genre:

Word-level LSTM text generator. Creating automatic song lyrics with Neural Networks.

I started talking about this project with the non-technical chat about the analysis I made of the a corpus of 5,000…

medium.com

Update: Automatic song lyrics creator with Word Embeddings

This is a continuation of the stories…

medium.com

However, I also hope this can be seen as a standalone piece of information.

All the code, some text corpora and more documentation can be found in the Github’s page for the on-going project:

enriqueav/lstm_lyrics

lstm_lyrics - LSTM text generation by word. Used to generate lyrics from a corpus of a music genre.

github.com

Background and objective of the classifier

The main objective of the lstm_lyrics project is to train a neural network to “learn” the style of the lyrics a musical genre and then being able to generate text lines from this. However, the algorithm always tries to add certain “variability” to avoid falling into certain infinite text loops or generating the exact same thing given a certain seed.

Unfortunately, after much experimentation, testing many different network architectures, data representations, etc, every single time this certain “variability” caused that some of generated text is, well, almost random.

Judging this situation, and given the fact that I was already working on text classifiers for my current job, I decided to create another network to classify between real text from the genre’s corpus and randomly generated lines. The idea is that, after this new network is trained it will be able to pre-filter the worst lines created by the lyrics generator. This means it will detect the generated lines that are closer to be random noise.

Creation of the training set

I have create a utility script to generate the training set that will be used by the new neural network. It can be executed as follow:

python3 utils/generate_classifier_set.py corpora/corpus_banda.txt banda_subset.txt random_banda.txt

I will not explain the script in detail but the basic idea is that it will create two files, the first one is a subset of the original corpus, ignoring all the lines that contain at least one of the ignored words (less common words filtered according to the parameter MIN_WORD_FREQUENCY).

The second file will contain randomly generated text, following these rules:

Use the same words as the first file, this means the total vocabulary of the corpus minus the ignored words (uncommon ones).
Contain the same number of lines (each line is a training or test example).
Words are chosen with the same probability as in the original corpus.
Length of the lines are chosen with the same probability as in the corpus, this means, if 30% of lines have 5 words in the corpus, roughly the same percentage of the randomly generated lines will have 5 words, and so on.

Given all these rules, both files will be very similar in size. To give a concrete example:

Both files have exactly the same number of lines (126,665 in this case), and roughly the same number of total words (703,435 vs. 705,279) and characters (3,515,098 vs. 3,523,796):$ python3 generate_random_lines.py corpora/corpus_banda.txt banda_subset.txt random_banda.txt
$ wc banda_subset.txt random_banda.txt
  126665  703435 3515098 banda_subset.txt
  126665  705279 3523796 random_banda.txt

The script to train the classifier will read these two files to create the training/test examples.

Classifier training

If you have followed the series of stories (and I now you have 😊), you already know that a single training example of the lyrics generator consists of a sentence, word by word, and the goal label is a single value, represented by the next word in the corpus. For instance:

>>> sentences[0]['put', 'a', 'gun', 'against', 'his']>>> next_words[0]'head'>>> sentences[1]['a', 'gun', 'against', 'his', 'head']>>> next_words[1]'pulled'

Given that input, the output of the network for a single example will be the the probability of each one of the possible words. So if your vocabulary is of 5,000 different words, the output will be a vector of 5,000 values with the probability of each words, all of them adding to 1. For instance, if the network is already trained, if you input:

['put', 'a', 'gun', 'against', 'his']

The highest of the 5,000 probabilities will be the one pointing to the word 'head' .

The case of a binary classifier is similar to this, but is even simpler because the output will be a single float value between 0 and 1 representing how confident the network is that the given sentence is positive, whatever that means in your context. In this case positive means real text (opposite to randomly generated).

Now onto the code.

The full script is available here, we will review the most import parts.

First we create the training set reading from the files containing positive and negative examples. We already covered how to create these files. Since not all the sentences have the same length (in words), we pad them with pad_and_split_sentences , we also create the labels y concatenating 0’s and 1’s with the correct length.

good_ones = process_file(sys.argv[1])
bad_ones = process_file(sys.argv[2])

x = pad_and_split_sentences(good_ones + bad_ones)
y = [1]*len(good_ones) + [0]*len(bad_ones)

Then we create the dictionary, after this process the variable words will be a sorted set containing all the different unique words in both files

print("Reading files and getting unique words")
words = set([PAD_WORD])
for line in x:
    words = words.union(set(line))
words = sorted(words)
print('Unique words:', len(words))

word_indices = dict((c, i) for i, c in enumerate(words))
indices_word = dict((i, c) for i, c in enumerate(words))

Next step is pretty standard, we shuffle and split the set in 90% training and 10% test

sentences, labels, sentences_test, labels_test = shuffle_and_split_training_set(x, y)

Then we get the model, you can read about Word Embedding in this story, it is basically a way to translate different words into vectors (in this case, with 32 dimensions). Then we pass these vectors to a 64 units bidirectional Long Short-Term Memory (a kind of RNN units).

Adding dropout is a way to avoid over-fitting. Finally, the output is a Dense layer of 1, with sigmoid activation. As we discussed, the output of the classifier will be a float value between 0 and 1, this is given by the sigmoid.

model = Sequential()
model.add(Embedding(len(words), 32))
model.add(Bidirectional(LSTM(64)))
model.add(Dropout(dropout))
model.add(Dense(1, activation='sigmoid'))

We then compile the model with binary_crossentropy loss and adam optimizer (which can be changed to experiment). To fit the model we are using a data generator, which I explained in the first story of the series, the basic idea is that examples+labels are feed to the model in small batches, instead of sending all them in one shot. This is mainly used when your training set does not fit into memory, or when you want do to data augmentation on execution time.

model.compile(loss='binary_crossentropy', 
              optimizer="adam", 
              metrics=['accuracy'])
print(model.summary())

model.fit_generator(generator(sentences, labels, BATCH_SIZE),
                    steps_per_epoch=int(len(sentences)/BATCH_SIZE) + 1,
                    epochs=10,
                    callbacks=callbacks_list,
                    validation_data=generator(sentences_test, labels_test, BATCH_SIZE),
                    validation_steps=int(len(sentences_test)/BATCH_SIZE) + 1)

At the very end, we call a function to use the final weights of the network on the test set, and obtain a kind of confusion matrix, obtaining the number of True Positives, True Negatives, False Positives, False Negatives. If the last argument is True, it will also print all the examples that returned False Negatives and False Positives, so you can visualize what kind of errors the network committed.

confusion_matrix(sentences_test, labels_test, True)

To start the training, we need to execute the following command

$ python3 classifier_train.py <positive_examples_file> <negative_examples_file>

Conclusion

With the proposed case (random text Vs. lyrics from a corpus), the training accuracy after 10 epochs is around 98% (acc: 0.9841) in the training set and 94.3% (val_acc: 0.9431) in the test set (validation).

This is very good considering there is case of ambiguity in both sides: 1. With (bad) luck, several random lines will look a lot like real lyrics. 2. There are cases where a real lyric taken from a corpus looks a lot like random noise, specially in mexican banda or reggaeton ¯\_(ツ)_/¯.

Extra

Just as an experiment, I tried to run the same classifier training, but with the labels scrambled, using the same examples but randomly changing their “positive” or “negative” labels. What will happen with the training set? What about the test/validation set every epoch?

The network is able to fit the training set in more than 70% (acc: 0.7148), but since the labels are random, there is no real relationship between what is marked as “positive” vs what is marked as “negative”. We can verify this fitting is indeed pointless because the accuracy of the validation (val_acc) is never more than 50%. As expected, even the “trained” network is not able to generalize to examples never seen during the training phase.