Using AI to generate lyrics

Ivan Liljeqvist
7 min readDec 5, 2016

Can we make AI write music and lyrics for us? Can we train machines to express feelings and thoughts in lyrics? These are some of the thoughts I’ve had during the last few days.

Being a software engineer and a data scientist I had to try for myself. And although my laptop doesn’t fully communicate its feelings yet it can generate some sick rhymes 😎💸🎤

A big shout out to my good friend Hannes Leskelä for coming with some suggestions for this article, he is an AI-God and Machine Learning expert.

The plan

The idea is to build a machine learning model that can take a sequence of words as an input and output a continuation of that sequence.

For example, if we give our model the string “Hi my name is Ivan“ we expect lyrics for a song starting with “Hi my name is Ivan“ as the output.

To achieve this goal we need to build a suitable model and find some training data.

Finding the training data

Because we want our model to generate lyrics we need a large file containing random lyrics. The training data will decide what kind of output our model will produce. Imagine we wanted to train a model to generate rap lyrics, we would then need to train the model on a data set containing mostly lyrics from rap songs. If we would like to build an AI-poet we would give our model some Shakespeare.

We can download a dataset containing lyrics from 50 years of Billboard-Year-End Hot 100 (1965–2015). The dataset is a csv-file with several columns. We want to extract the content from all the rows from the “Content” column and put everything in a lyrics file.

Building the model

One machine learning model that fits this task well is an Artificial Neural Network (ANN). If you are completely new to ANNs I recommend my previous story about ANNs and deep learning.

Text, articles, lyrics, etc. are sequential data. Therefore a regular ANN won’t do the job because it doesn’t have any kind of internal memory. We’ll need to use Recurrent Neural Networks (RNN). An RNN is a neural network with an internal short-term memory.

Structure of a LSTM, good to know how it works under the hood, however the framework Keras will helpt us to construct our LSTM

However our model has to have a short-term memory AND a long-term memory. The reason is because we want our network to remember parts of the sequence it has seen many time steps back. We will therefore need a special kind of RNN called Long short-term memory (LSTM).

Let’s take a look at how we would train our network.

How are we going to train?

Our goal is to train a network that can take a string as the input and output a character that we should append to the string in order to create our lyrics.

Here is an example, we give our network “Sexy and I Kn” and it outputs “o” which we append to our original string and get “Sexy and I Kno”. We repeat this procedure and next time we get “w” back and append it leaving us with “Sexy and I Know”. Because our model will be trained on a very large dataset we’ll hopefully avoid over-fitting and the lyrics we get from the network will be completely unique.

We are going to use supervised learning where the network is going to try to fit an array of sequences to an array of characters. Sequences is an array containing strings such as “Sexy and I Kn” while characters will contain the “next character” for every sequence. The element in characters corresponding to “Sexy and I Kn” would be “o”.

How do we do that in Python?

We divide the corpus ( lyrics ) into sequences where each sequence has length sequence_lengt and the step between sequences is sequence_step.

This means that the first sequence will be from character zero to character number sequence_length, the second sequence will be from sequence_step to (sequence_length+sequence_step) and so on until we have divided the entire corpus in sequences.

next_chars contains the next character for every sequence in sequences.

sequence_length = 40
sequence_step = 3
sequences, next_chars = create_sequences(corpus, sequence_length, sequence_step)

Now we have an array sequences which contains strings and array next_chars which contains characters.

Our network can’t be trained with strings and characters, it needs numbers. We therefore need to vectorize sequences and next_chars and turn them into arrays of ones and zeros.

X = np.zeros((len(sequences), sequence_length, len(chars)), dtype=np.bool)y = np.zeros((len(sequences), len(chars)), dtype=np.bool)for i, sequence in enumerate(sequences):

y[i, char_to_index[next_chars[i]]] = 1
for t, char in enumerate(sequence):
X[i, t, char_to_index[char]] = 1

Y is next_chars but now in vectorised form. Each character is represented as a one-hot array where all characters in the alphabet are set to zero except the character that the y-element represents which is set to one.

X is constructed in a similar way — X is an array of sequences where each sequence is an array of one-hot arrays represeting the letters in that sequence.

Coding the network

I strongly recommend using Python when doing machine learning because the Python ecosystem is very well developed for numerical calculations and projects that have to do with machine learning. For this task I will be using Keras which is a high-level deep-learning framework that lets you choose to either use TensorFlow or Theano as the underlying backend.

The beauty of Keras is that it lets you construct your networks in an object-oriented way. Once you have figured out how you want to build your network Keras will let you turn your ideas into code very swiftly.

As we discussed above the network will take a vectorized sequence as the input. Such sequence consists of a matrix with dimensions (sequence_length) times (number of characters in the alphabet).

model = Sequential()
model.add(LSTM(128, input_shape=(sequence_length, len(chars))))
model.add(Dense(len(chars)))
model.add(Activation('softmax'))
optimizer = RMSprop(lr=0.01)
model.compile(loss='categorical_crossentropy', optimizer=optimizer)

Our network will have one hidden layer with 128 LSTM neurons and the output will be an array with probabilities for each letter in the alphabet. The letter with the highest probability is the most suitable to append to the sequence. We will then parse this output, figure out which letter has the highest probability and append it to our sequence.

Training the network

Thanks to Keras the interface for training is very straightforward.

model.fit(X, y, batch_size=128, nb_epoch=EPOCHS)

Our neural network will try to fit X to y by propagating through the network EPOCHS number of times. Each epoch the network will alter it’s internal structure in order to fit the data better.

Generating lyrics

We’re finally set to generate some lyrics! We give sentence the string we want our lyrics to start with. We vectorise this sentence and put the vectorised version into x.

When we’ve vectorised the sentence we can give it to model.predict to get the probabilities for each letter in the alphabet. By using helper.sample we pick the letter with the highest probability for this sentence ( most fitting ). We repeat this procedure LYRIC_LENGTH number of times each time adding a letter to our lyrics.

sentence = "The grass is green and my car is red lik"
sentence = sentence.lower()
generated = sentence
sys.stdout.write(generated)

for i in range(LYRIC_LENGTH):
x = np.zeros((1, SEQUENCE_LENGTH, len(chars)))
for t, char in enumerate(sentence):
x[0, t, char_to_index[char]] = 1.

predictions = model.predict(x, verbose=0)[0]
next_index = helper.sample(predictions, diversity)
next_char = indices_char[next_index]

generated += next_char
sentence = sentence[1:] + next_char

sys.stdout.write(next_char)
sys.stdout.flush()

DEMO TIME!

As expected, the network gave strange answers for the first couple of epochs. It wasn’t even able to produce real words:

Given “my name is ivan and i live in stockholm” generated:

my name is ivan and i live in stockholm l s bkasst w sewhenth obhimgan din si of ayn therie we t omrt ai yesi wats e baw aink thitkeyohithe thh dotha y a boin l b g we ms f ryop l ka bouirm lly uinouoc ow …

However, after leaving the network training for a couple of hours we saw some real progress!

After 10 epochs it started forming real words and after 20 it could generate this:

Given “my name is ivan and i live in stockholm” generated:

my name is ivan and i live in stockholm and i can do it and i wanna get closer so call me all alone i said its all i act so good and i wanna know the time i could stay in the world, youre always the only one that i wanna be…

Discussion

It’s so cool to see how a network can learn from experience! From not being able to produce a single word to busting sick rhymes in under 20 epochs.

Feel free to experiment with the structure of the network. You could for example try to add more hidden layers and change the number of nodes in the hidden layers.

I recommend to train on a GPU as it will decrease the training time drastically. Look at this AMI for AWS p2 instances that has TensorFlow and Keras already pre-installed.

If you are interested LSTM networks and how they differ from RNNs I recommend this article.

You can find the source code below:

--

--

Ivan Liljeqvist

Professional software developer. Love creating things. AI is fascinating. https://ivan.ai