Word-level LSTM text generator. Creating automatic song lyrics with Neural Networks.

enrique a.
Coinmonks
Published in
10 min readJun 4, 2018

--

I started talking about this project with the non-technical chat about the analysis I made of the a corpus of 5,000 lyrics (more than 5 million characters) of Mexican banda music (in Spanish).

I decided to write the following story/tutorial in English, because this way it can naturally reach a larger audience.

The complete code and corpus can be found at this github repository.

UPDATE: The second part is available here. Where I explain the training using Word Embeddings. “Word embeddings provide a dense representation of words and their relative meanings.”

Quick background

Long short-term memory (LSTM) units (or blocks) are a building unit for layers of a recurrent neural network (RNN). A RNN composed of LSTM units is often called an LSTM network. A common LSTM unit is composed of a cell, an input gate, an output gate and a forget gate. Source wikipedia.

RNNs can be used to make predictions, or to learn from sequential data and generate similar data.

Like many instances of text generation online, I took the inspiration from the keras-team example and from this project to generate Trump-like tweets.

The main difference is that I created the text generator working on word-level instead of character-level, as the previous examples. Besides, as the training set will not fit into memory (specially if sent to GPU) I needed to create a data generator for the fit and evaluate Keras functions.

I have to admit that I worked on the version of this project before doing a quick google search. I finally did it and found this project, that is also based on the same script and hence is somewhat similar to my version:

Explanation of the algorithm

The idea is to train the RNN with many sequences of words and the target next_word. As a simplified example, if each sentence is a list of five words, then the target is a list of only one element, indicating which is the following word in the original text:

>>> sentences[0]['put', 'a', 'gun', 'against', 'his']>>> next_words[0]'head'>>> sentences[1]['a', 'gun', 'against', 'his', 'head']>>> next_words[1]'pulled'

We don’t actually send the strings, but a vectorized representation of the word inside a dictionary of possible words (more on that later). The idea is that after many epochs the RNN will learn “the style” of how the corpus is written, trying to adjust the weights of the network to predict the next word given a sequence of the N previous words.

The text corpus

As I explained in the other story, the corpus contains more than 5 million characters in more than 1 million words. Getting the text is just the beginning of the problem, because as any other machine learning project, it was necessary to analyze, clean and perform some pre-processing of this data.

I won’t enter into details now (probably another story), but to say the least, the data was dirty. Thousands of typos, misspellings, slang, incorrect punctuation, spanglish, etc.

Now to the algorithm.

Reading the corpus, split into words

The first step is to read the corpus and split it into words.

corpus = sys.argv[1] # first command line arg
with io.open(corpus, encoding='utf-8') as f:
text = f.read().lower().replace('\n', ' \n ')
print('Corpus length in characters:', len(text))

text_in_words = [w for w in text.split(' ') if w.strip() != '' or w == '\n']
print('Corpus length in words:', len(text_in_words))

Note the call to .replace(‘\n’, ‘ \n ‘, this is because we want the newline as a word. The idea behind this is that we are also leaving to the network the decision on when to start a new line (after several words). After this, text_in_words is a big array containing all the corpus, word by word.

>>> text_in_words[3000:3005][‘ella’, ‘era’, ‘como’, ‘estar’, ‘\n’]

Getting word frequencies

In the character level text generators, you may end with 30–50 different dimensions, one for each of the different characters. In a word level generator like the current one, you will have a dimension for each one of the different words, which can turn out to be in the tens of thousands (specially in a corpus as dirty as this one).

In order to avoid this large numbers, we calculate the frequency of each one of the words, so we can use this information to filter the uncommon words, reducing the dimensionality, and hence the memory and time to train the network.

# Calculate word frequency
word_freq = {}
for word in text_in_words:
word_freq[word] = word_freq.get(word, 0) + 1

ignored_words = set()
for k, v in word_freq.items():
if word_freq[k] < MIN_WORD_FREQUENCY:
ignored_words.add(k)

words = set(text_in_words)
print('Unique words before ignoring:', len(words))
print('Ignoring words with frequency <', MIN_WORD_FREQUENCY)
words = sorted(set(words) - ignored_words)
print('Unique words after ignoring:', len(words))

word_indices = dict((c, i) for i, c in enumerate(words))
indices_word = dict((i, c) for i, c in enumerate(words))

The variable MIN_WORD_FREQUENCY is a parameter that indicates what is the minimum number of appearances a word needs to have in order to “make the cut” and be in the final dictionary of words.

In other words, if we set MIN_WORD_FREQUENCY to 10, we will only consider words that appear 10 or more times in the corpus. For debugging purposes, we print how big was the original dictionary, and how big is after the cut of uncommon words.

Then we create the dictionaries to translate from word to index and from index to word. This is just like the keras-team example.

Creating and filtering the sequences

If we remember, at this point we have text_in_words which is an array containing all the corpus word-by-word. We need to create sequences of size SEQUENCE_LEN (another parameter that can be picked by hand) and store them in sentences, and in the same index, store the next word in next_words.

But there is a problem: in text_in_words we still have many of the words to be ignored. We cannot just go ahead and remove these words because we would be breaking the language and leaving incoherent sentences. That’s why we need to validate each possible sequence+next_word, it should be ignored it if contains at least one of the ignored words.

# cut the text in semi-redundant sequences of SEQUENCE_LEN words
STEP = 1
sentences = []
next_words = []
ignored = 0
for i in range(0, len(text_in_words) - SEQUENCE_LEN, STEP):
# Only add sequences where no word is in ignored_words
if len(set(text_in_words[i: i+SEQUENCE_LEN+1]).intersection(ignored_words)) == 0:
sentences.append(text_in_words[i: i + SEQUENCE_LEN])
next_words.append(text_in_words[i + SEQUENCE_LEN])
else:
ignored = ignored+1
print('Ignored sequences:', ignored)
print('Remaining sequences:', len(sentences))

Shuffle and split training set

Next step is standard, we shuffle the training set, and split it in training and test set (98%-2% by default).

sentences, next_words, sentences_test, next_words_test = shuffle_and_split_training_set(sentences, next_words)

Building the model

Now we build the RNN model. In this example I used a two level stacked set of LSTM bidirectional units.

model = Sequential()
model.add(Bidirectional(LSTM(128), input_shape=(SEQUENCE_LEN, len(words))))
if dropout > 0:
model.add(Dropout(dropout))
model.add(Dense(len(words)))
model.add(Activation('softmax'))

There are several arbitrary decisions in this architecture, I actually haven’t had the time to cross validate different sizes, types of units, etc.

These are some discussions about the possibilities of the architecture:

  • Bidirectional vs. regular LSTM here.
  • Dropout is a regularization technique to prevent over-fitting here.
  • Number of LSTM units, I suspect 256 may be too much. Discussion here.

Depending on the boolean parameter SIMPLE_MODEL we create a one layer or two layer model. A single layer of regular LSTM should suffice to give reasonably good results, as in the keras-team example.

model.add(LSTM(128, input_shape=(maxlen, len(chars))))

Data generator

Without thinking too much, in my first tries I wanted to use the same model.fit strategy as the character-level examples, that is to send the whole training set to the model at once. This easily and quickly led me to out of memory errors.

After a quick analysis I found the reason. To do the vectorization of the sequences into training set (x, y) we need these numpy arrays:

x = np.zeros((len(sentences), SEQUENCE_LEN, len(words)), dtype=np.bool)
y = np.zeros((len(sentences), len(words)), dtype=np.bool)

Without word filtering, I had roughly 1 million sentences (len(sentences) = 1000000), SEQUENCE_LEN = 10 and 40,000 different words (len(words)=40000). With these numbers, x had a size of 400,000,000,000(!). Considering that in numpy is 1 byte, this gave me an approximate of 400 GB of memory(!).

Hence the need of data generators. Using data generators, you feed the model with chunks of the training set, one for each step, instead of feeding everything at once.

def generator(sentence_list, next_word_list, batch_size):
index = 0
while True:
x = np.zeros((batch_size, SEQUENCE_LEN, len(words)), dtype=np.bool)
y = np.zeros((batch_size, len(words)), dtype=np.bool)
for i in range(batch_size):
for t, w in enumerate(sentence_list[index]):
x[i, t, word_indices[w]] = 1
y[i, word_indices[next_word_list[index]]] = 1

index = index + 1
if index == len(sentence_list):
index = 0
yield x, y

The generator function gets a list of sentences and next_words, and the size of the batch. Then it yields two numpy arrays of batch_size. We use the index variable to keep track of the examples we have already returned. Of course it needs to be reinitialized to 0 when we reach the end of the lists. This generator can be used for both training and evaluation (just passing a different sentence_list and next_word_list).

Finishing the model

The functions sample and on_epoch_end are basically unchanged from the keras-team example. However, in the model compile I added several Keras callbacks.

file_path = "./checkpoints/LSTM_LYRICS-epoch{epoch:03d}-words%d-sequence%d-minfreq%d-loss{loss:.4f}-acc{acc:.4f}-val_loss{val_loss:.4f}-val_acc{val_acc:.4f}" % (
len(words),
SEQUENCE_LEN,
MIN_WORD_FREQUENCY
)
checkpoint = ModelCheckpoint(file_path, monitor='val_acc', save_best_only=True)
print_callback = LambdaCallback(on_epoch_end=on_epoch_end)
early_stopping = EarlyStopping(monitor='val_acc', patience=5)
callbacks_list = [checkpoint, print_callback, early_stopping]

The first one is a ModelCheckpoint to save the weights every epoch and the second is EarlyStopping to halt the training if there no gain in the loss in 5 epochs.

Training the model

Finally, we cal model.fit_generator (instead of model.fit) with the data generator, the callbacks and the number of epochs. We also send another generator with the test data, so it gets evaluated every epoch.

model.fit_generator(generator(sentences, next_words, BATCH_SIZE),
steps_per_epoch=int(len(sentences)/BATCH_SIZE) + 1,
epochs=100,
callbacks=callbacks_list,
validation_data=generator(sentences_test, next_words_test, BATCH_SIZE), validation_steps=int(len(sentences_test)/BATCH_SIZE) + 1)

RNNs can be hard to train. Even with a fairly powerful GPU (GeForce GTX 1070 ti) each epoch takes more than one hour with the stacked LSTM architecture.

Executing the training

To start the training, you need to run (naturally, you can run it with your own corpus).

git clone https://github.com/enriqueav/lstm_lyrics.git
cd lstm_lyrics
python3 lstm_train.py corpora/corpus_banda.txt examples.txt

UPDATE: You can also refer to the word embedding version.

After that, the script will print information about the current training set, preprocessing, etc. The example corpus (Mexican “banda” music) contains more than 5 million characters in more than 1 million words.

Corpus length in characters: 5502159Corpus length in words: 1066242

Originally there are more than 35,000 different words in the corpus. After filtering the words with a frequency less than 10 (MIN_WORD_FREQUENCY = 10), there are only 6,605.

Unique words before ignoring: 36990Ignoring words with frequency < 10Unique words after ignoring: 6605

As the STEP is 1, originally there are approximately 1 million different sequences. However, since we already ignored the 30,000 less frequent words, we also need to ignore the sequences that contain at least one of these ignored words. After this cut, we get approximately 537,000 valid sequences.

Ignored sequences: 529230Remaining sequences: 537002Shuffling sentencesShuffling finished

Finally we split these shuffled 537,000 in 98% train 2% test set.

Size of training set = 526261Size of test set = 10741

We then build the model and start the training

Build model...Epoch 1/100...

Monitor the results

Just like the character-level generators, each epoch several example sentences will be written to the examples text file, where the seed is picked randomly from the original corpus.

By epoch number 20 the accuracy of the train set will be around 90%, however on the test set we will not see numbers this high, that is normal. Remember we do not aim to obtain human-level accuracy, only to learn “the style” and generate somewhat coherent lyrics.

Examples

Unfortunately, this will make more sense if you know some spanish, and specially if you are familiar with the Mexican banda style.

Generating with seed: “de mi porque estoy comprometido la que se queda la quiero y la que se”

de mi porque estoy comprometido la que se queda la quiero y la que se va la olvido con el mundo de mi vida siempre tu te marchas el amor con mis caricias fuerza que he sido tan caminar el destino en la mafia la noche y fue el demás me puse a ver a todo el mundo y ahi que te ame decir que

Generating with seed: “tus penas o si alguna vez alguien te ha lastimado si tu corazón por el”

tus penas o si alguna vez alguien te ha lastimado si tu corazón por el momento es libre o ya está ocupado porque el mío creo que a asi de hoy alguien todo me ha robado a nadie mujer no hay alguien me la estoy pero no me estoy perdiendo se tiene de no te lo que a hacer no se te vaya a olvidar

Generating with seed: “mis brazos me muero de ganas por volverte a besar en mis noches despierto gritando”

mis brazos me muero de ganas por volverte a besar en mis noches despierto gritando tu nombre y me muero de miedo al pensar que a otro hombre le estaras en la vida tan bella y lo que alguna quise tener que me hubieras pasado nada era un sueño no se diga estas flores del norte estas palabras y muero por eso es tan cierto

Next steps

For now I have only included the training part of the project. I will expand this (or create another story) for the actual use of the training model to generate lyrics from a seed. I will also upload weights of already trained networks.

UPDATE June 15, 2018: Changed to include the use of newline as a separate word, and the sending of validation data on fit_generator.

UPDATE January 21, 2019: Add link to the second part of the story.

Join Coinmonks Telegram Channel and Youtube Channel get daily Crypto News

Also, Read

--

--

enrique a.
Coinmonks

Writing about Machine Learning, software development, python. Living in Japan working as a machine learning leader in a Japanese company.