Using Text Generation to get the Lyrics for the next Arctic Monkey Song

Rajwrita Nath
5 min readJul 7, 2020

I am thoroughly obsessed with the Arctic Monkeys and I love Machine Learning and find it absolutely amusing how many incredible projects you can build with it. Hence, why not combine the two!

It’s been a hot minute since they released a song, so on one strange evening, I thought of implementing a simple code to train a text generation model using Keras and TensorFlow to produce a brand new Arctic Monkey song! (Terms and Conditions applied, you cannot compare it to any of the real ones, sigh)

This blog takes you through the code and the entire code is in my GitHub, take a look here.

We start with a dataset of almost all the songs from the Arctic Monkeys. You can find the dataset here. Try generating text using your own dataset when you go ahead with this code.


We start by importing some general data frame manipulation libraries and some TensorFlow and Keras libraries for the deep learning models.

import numpy as np
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.layers import Embedding, LSTM, Dense, Dropout, Bidirectional
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.models import Sequential
from tensorflow.keras.optimizers import Adam
from tensorflow.keras import regularizers
import tensorflow.keras.utils as ku

Next, import the data:

data = open('AM.txt').read()

Next, we use a tokenizer and fit it on the text. The tokenizer creates a dictionary of words of the overall corpus. This is essentially a key-value pair. The key is the word and the value, the token generated for that particular word.

Simply saying, the tokenizer breaks up a sentence of strings into words and assigns a unique integer value to it. This is an essential step to prepare the data for the embedding layer which is coming up next.

We can find the total number of words in the corpus, by getting the length of its word index. We’ll add one to this, to consider outer vocabulary words.

Here’s the corresponding code:

tokenizer = Tokenizer()data = open('AM.txt').read()
total_words = len(tokenizer.word_index) + 1

Next, we create input sequences using list of tokens. Our input sequences are nothing but a python list. For each line in the corpus of text, we will generate a token list using the tokenizer. This process converts a line of text like

Arabella’s got some interstellar-gator skin boots

into a list of tokens representing these words. This same process happens for every other line in our data set.

Let’s take a look at the code:

input_sequences = []
for line in corpus:
token_list = tokenizer.texts_to_sequences([line])[0]
for i in range(1, len(token_list)):
n_gram_sequence = token_list[:i+1]

Hence, we see that the input sequences are simply the sentences that are broken down into phrases. Following this, we need to now find the length of the longest sentence in the corpus. This is very simple, you need to just iterate over all the sentences and find the longest one.

max_sequence_len = max([len(x) for x in input_sequences])

Now we pad all of the sequences so that they are all of the same lengths. Pre-pad the sequences with zeros which consequently makes it easier to extract the label. We simply grab the last token and we get the label.

input_sequences = np.array(pad_sequences(input_sequences, maxlen=max_sequence_len, padding='pre'))

After padding, we create predictors and labels, which is basically splitting our sequences into x’s and y’s. We use the slicing property of python here. The code:

predictors, label = input_sequences[:,:-1],input_sequences[:,-1]

Now, when we have our data as x’s and y’s we can hop on to creating a neural network to classify what the next word should be when we are provided with a set of words.

Start with the embedding layer

The embedding layer is an indispensable layer to any sort of deep learning model that is used to make sense of words. What an embedding layer actually does is it takes a vector from a higher dimensional space to a lower-dimensional space and does it in a way such that words with similar meanings have similar mathematical values. In this way, we can actually perform mathematical operations on these vectors. In a line, it handles all our words and gives meaning to it in the Neural Network.

The first parameter handles all our words and the second parameter is the number of dimensions to use to plot the vector for a word. The final parameter is the size of the input dimensions to be fed in and this is nothing but the length of the longest sequence minus 1.

The 1 is subtracted because earlier, we cropped off the last word of each sequence to get the label, so our sequences will be one less than the maximum sequence length.

model.add(Embedding(total_words, 100, input_length=max_sequence_len-1))

Time to add an LSTM

The cell state in an LSTM means that they carry context along with them. The LSTMs make sure that it is not just the adjacent word that makes an impact on the next word.

Instead of a single layer LSTM, you could also use a stacked LSTM. A detailed blog on LSTMS here.

Using Bidirectional LSTMs, you feed the learning algorithm with the original data once from beginning to the end and once from end to beginning. This helps the neural network to better understand the text. Bidirectional LSTMs also help the neural network to converge a bit faster.

The return sequences flag is set to True in order to pass sequence information to the second LSTM layer instead of just its end states

model.add(Bidirectional(LSTM(150, return_sequences = True)))

Next we apply dense layers to further capture linear relationships. This converts the output of the above layers into the word probabilities. The softmax activation function converts all the input word probabilities from (-∞,∞ ) to (0,1).

model.add(Dense(total_words/2, activation='relu', kernel_regularizer=regularizers.l2(0.01)))model.add(Dense(total_words, activation='softmax'))

Since we are doing a categorical classification we will set the laws to be categorical cross-entropy. As an optimizer we use the adam optimizer.

Epochs, almost there

The dataset has very little data and hence we train the model for around 500 epochs. Finally, we’ll train for a lot of epoch, say about 500, as it takes a while for.

history =, label, epochs=100, verbose=1)

The more words you wish to predict, the more gibberish is going to crop up. This is because each word is being predicted and so is the next one and the one after that. So one is less certain than the next.

Here is the final text that the network predicted!

seed_text = "I really like the Arctic Monkeys and "

I really like the Arctic Monkeys and at me just been me so around there’s lover i’ll fall long long time round but you know if you want to tell there slippers of stay with against the wall end action lined up popping up up up and stay with your hands around your head around my neck gets lotion i sit and talk to me on the floor with in the day round but i said i want to tell settling down or giving up in wicker chair and you imagined i could imagined i could want to you found the place on of stay with the (?)

Given enough words on the corpus, the neural network trains on the corpus, and with the predicted next word we can predict some sophisticated text.

You could try this out on your own with songs from your favorite artists and have fun! Follow me on GitHub for more such descriptive projects.

Stay tuned for more :)



Rajwrita Nath

Women Techmakers Scholar 2020, DSC NSEC Lead, Moderator at Manning Publications Co.