Automatic song lyrics generation with Word Embeddings

enrique a.
3 min readJun 19, 2018

--

This is a continuation of the stories:

Word-level LSTM text generator. Creating automatic song lyrics with Neural Networks.

And (in Spanish):

Auto-perreo. Generando letras de reggaetón con redes neuronales.

I have spent some time exploring the possibilities to improve the word-level text generation to create music lyrics learning the “style” from a corpus of a musical genre. One of the first possible things I found it the use of a Embedding layer at the beginning of the RNN model.

From Keras documentation, embedding layers:

Turns positive integers (indexes) into dense vectors of fixed size. eg. [[4], [20]] -> [[0.25, 0.1], [0.6, -0.2]]

In other words:

Word embeddings provide a dense representation of words and their relative meanings.

They are an improvement over sparse representations used in simpler bag of word model representations.

Word embeddings can be learned from text data and reused among projects. They can also be learned as part of fitting a neural network on text data.

A trained layer converts the words in the vocabulary in a way that the ones with similar use will be close in their vectorial representation. This means, for instance, that words representing colors: “white”, “red”, “yellow”, etc will be close among them.

Embedding layers can be trained from scratch or it is possible to use pre-trained, like in this gist.

I have separated the training script as another python file inside the same repository: https://github.com/enriqueav/lstm_lyrics/blob/master/lstm_train_embedding.py

Changes made for this version

The main difference (aside, naturally from the model), is the way to represent the training examples. The previous version uses a one-hot representation of the sentences (vectors of zero with a single 1 in the column of the used word). This version uses a vector of integers, where each one is the index of the word inside of the dictionary.

This means, the examples of the previous version are represented as np arrays of booleans:

x = np.zeros((batch_size, SEQUENCE_LEN, len(words)), dtype=np.bool)        y = np.zeros((batch_size, len(words)), dtype=np.bool)

And in the new version, the data type and the dimensions are different:

x = np.zeros((batch_size, SEQUENCE_LEN), dtype=np.int32)        
y = np.zeros((batch_size), dtype=np.int32)

Besides that, the model is compiled with loss equals tosparse_categorical_crossentropy instead of categorical_crossentropy. In a nutshell, categorical_crossentropy is for one-hot encoded, while sparse_categorical_crossentropy is used when the representation is integers.

model.compile(loss=’sparse_categorical_crossentropy’, optimizer=”adam”, metrics=[‘accuracy’])

Results so far

One advantage of this architecture is the training speed. In my initial tests it takes roughly half of the time per epoch, of course it depends on the hardware type, the output dimensions of the Embedding, the number of units of the LSTM layers, etc.

These are some of the example lyrics generated by a partially trained network on a reggaeton corpus:

Generating with seed:
“con el palo
pues claro te doy con el”

con el palo
pues claro te doy con el palo
pues claro que el pasado se siente
mientras te busque
vives en mi cama
y contigo se que yo me muero hasta el amanecer
yo sé que tu te estas
y no puedo dormir
que no me puedo aguantar
me siento

Generating with seed:
“pone cerca de ti
siempre a mi lado la”

pone cerca de ti
siempre a mi lado la leyenda
solo te pido
quiero que se repita la ocasión
quiero que tú eres la única que me pone a mí
tú eres la única que me hace morir
lo que no es fácil no te tengo
si te pegas no me me pego

Generating with seed:
“otro nivel de música pasarla bien es el destino”

otro nivel de música pasarla bien es el destino
quiero estar contigo
y no te hagas la santa
yo te doy lo que te encanta
te doy por lo que sea
y por si no me lo tengo
que me desespero por ti
y por ti yo estoy pa mi

So far I haven’t notice a great improvement in the “accuracy” of the resultant lyrics, but it may need a different network architecture to truly show the potential of the use of word embeddings.

--

--

enrique a.

Writing about Machine Learning, software development, python. Living in Japan working as a machine learning leader in a Japanese company.