Music Generation using LSTMs in Keras

Animesh Sharma
Intel Student Ambassadors
6 min readDec 28, 2018

Music is food for the soul. It is our constant companion during depressing times as well as in happiness. Talented musicians leave us in awe with their performances and have a lasting impression on our lives. This was my motivation to make this project with an A.I. flavor mixed into it. I want to generate a beautiful piece of music that will make people awed by how well a machine can mimic a human artist.

GIF credits: https://gfycat.com/

But how can we use music for model training? What kind of preprocessing does music data need before going into a model? We will try to answer these questions in this article and then try to use the power of Deep Learning to generate some cool music.

First of all, what is music data?

Music is just an analog signal having two components, Amplitude and Frequency. We are going to explore the amplitude part of the music signal to predict the amplitude of the notes of a song based on the previous notes.

LSTMs are a variant of Recurrent Neural Networks and work well in cases of sequential data. They make predictions based on the current input as well as the previous inputs instead of treating every beat independently. Since a beat in music also depends on the previous beats, it is also a type of sequential data and an LSTM model is best suited for it.

Let’s get started!!

First of all, import the necessary modules and functions.

import numpy as np
import pandas as pd
import pydub
from keras.layers import Dense, LSTM, LeakyReLU
from keras.models import Sequential, load_model
from scipy.io.wavfile import read, write

If you want to train your model on mp3 files, the following lines of code will do the trick.

# converting mp3 file to wav file
sound = pydub.AudioSegment.from_mp3("Numb_piano.mp3")
sound.export("Numb.wav", format="wav")
sound = pydub.AudioSegment.from_mp3("Eminem.mp3")
sound.export("Eminem.wav", format="wav")

If you already have your songs in wav format, then your code can be spared of the previous lines and you can directly load the data using the scipy read function.

# loading the wav files
rate, music1 = read(‘Numb.wav’)
rate, music2 = read(‘Eminem.wav’)

Music files can become huge amounts of data as they are sampled and converted into discrete signals after reading from the wav files. Each second of music corresponds to around thousands of samples. The number of samples depends on the sampling rate used. By default, scipy will generate 44100 samples per second. We will not train on such a huge amount of data for now as we would like to observe how well a model can learn music by feeding it only some seconds of music.

# taking only some part of the songs and converting to a dataframe
music1 = pd.DataFrame(music1[0:400000, :])
music2 = pd.DataFrame(music2[0:400000, :])

Now, let’s take a look at our data.

We can clearly see that the music dataframe has two columns, namely the two channels of our song and it is of integer type. What we are going to do is train two models on the two channels and make predictions separately for both the channels and then concatenate the results to generate a piece of music.

Data Pre-processing:

To prepare our data for an LSTM network, we must first clear out how are we going to feed the data into our neural network. Since we are making two models for each channel, each model has access to one column of the music dataframe. Thus, to convert this into a supervised learning problem or rather a regression problem, we have to arrange our data in such a manner that using the data of first 3 rows, we can predict the value of the 4th row and by using the data of 2nd, 3rd, and 4th rows, we can make predictions for the 5th row and so on. Let us define a helper function to do this for us.

# function to create training data by shifting the music data 
def create_train_dataset(df, look_back, train=True):
dataX1, dataX2 , dataY1 , dataY2 = [],[],[],[]
for i in range(len(df)-look_back-1):
dataX1.append(df.iloc[i : i + look_back, 0].values)
dataX2.append(df.iloc[i : i + look_back, 1].values)
if train:
dataY1.append(df.iloc[i + look_back, 0])
dataY2.append(df.iloc[i + look_back, 1])
if train:
return np.array(dataX1), np.array(dataX2), np.array(dataY1), np.array(dataY2)
else:
return np.array(dataX1), np.array(dataX2)

The function takes three arguments, the dataframe to shift, the number of features or samples you want in your data to predict the next sample and whether you want to generate training or test data. X1 and X2 represent the features of channels 1 and 2. Its dimension will depend on the look_back variable. If look_back is 3, it will be a numpy array of ( ,3) dimension. y1 and y2 are the dependent values for channels 1 and 2. Thus we have converted our music data into a form in which most machine learning algorithms operate on, i.e. a feature vector (X1 and X2) and labels or dependent variable (y1 and y2).

To generate our training data, let’s use some samples from both the songs with look_back of 3. Also, train=True will tell the function to generate training data, i.e. features as well as labels.

X1, X2, y1, y2 = create_train_dataset(pd.concat([music1.iloc[0:160000, :],music2.iloc[0:160000, :]], axis=0), look_back=3, train=True)

We had to use pd.concat to concatenate the music1 and music2 dataframes to generate training data having two different songs.

Now to prepare our test data, we just have to pass train=False to our helper function and we also pass different samples from the dataframes.

test1, test2 = create_test_data(pd.concat([music1.iloc[160001 : 400000, :],music2.iloc[160001 : 400000, :]], axis=0), look_back=3, train=False)

Let’s reshape our 2-d data into a 3-d vector as LSTMs expect the data to be 3d.

X1 = X1.reshape((-1, 1, 3))
X2 = X2.reshape((-1, 1, 3))
test1 = test1.reshape((-1, 1, 3))
test2 = test2.reshape((-1, 1, 3))

Model Training:

Now that our train and test data are completely ready, let’s use the Keras API to build our LSTM model.

# LSTM Model for channel 1 of the music data
rnn1 = Sequential()
rnn1.add(LSTM(units=100, activation='relu', input_shape=(None, 3)))
rnn1.add(Dense(units=50, activation='relu'))
rnn1.add(Dense(units=25, activation='relu'))
rnn1.add(Dense(units=12, activation='relu'))
rnn1.add(Dense(units=1, activation='relu'))
rnn1.compile(optimizer='adam', loss='mean_squared_error')
rnn1.fit(X1, y1, epochs=20, batch_size=100)

Thus we have an LSTM layer, 3 hidden layers, and an output layer. Adam optimizer is used with mean squared error as the loss criterion as it is one of the best choices for a regression task. We train our model for 20 epochs and batch size of 100 using ReLu activation function in each layer. Similarly, we can train our 2nd model (rnn2) for channel 2 in a similar manner.

We can then make predictions on our training data using the predict method of our models.

# making predictions for channel 1 and channel 2
pred_rnn1 = rnn1.predict(test1)
pred_rnn2 = rnn2.predict(test2)

Let us now save our prediction and our test data into wav format for comparison.

# saving the LSTM predicitons in wav format
write('pred_rnn.wav', rate, pd.concat([pd.DataFrame(pred_rnn1.astype('int16')), pd.DataFrame(pred_rnn2.astype('int16'))], axis=1).values)
# saving the original music in wav format
write('original.wav',rate, pd.concat([music1.iloc[160001 : 400000, :], music2.iloc[160001 : 400000, :]], axis=0).values

You will see that the output generated has captured the music pretty well but it comes out to be distorted. To overcome this, I thought of various methods like increasing epochs, adding more layers, but these didn’t work. Then I observed that the predictions made are always positive values due to the ReLu activation but the original music data has some negative values too. Thus, I thought of using the LeakyReLu activation so that I can have some negative values in my predictions too and it worked like a charm. Let us edit our model building step to incorporate LeakyReLu.

rnn1 = Sequential()
rnn1.add(LSTM(units=100, activation='linear', input_shape=(None, 3)))
rnn1.add(LeakyReLU())
rnn1.add(Dense(units=50, activation='linear'))
rnn1.add(LeakyReLU())
rnn1.add(Dense(units=25, activation='linear'))
rnn1.add(LeakyReLU())
rnn1.add(Dense(units=12, activation='linear'))
rnn1.add(LeakyReLU())
rnn1.add(Dense(units=1, activation='linear'))
rnn1.add(LeakyReLU())
rnn1.compile(optimizer='adam', loss='mean_squared_error')
rnn1.fit(X1, y1, epochs=20, batch_size=100)

Making the same changes in rnn2 and then running the predict method will give you the new predictions. Save them and compare them with the original piece of music. The distortion has disappeared and now it is quite difficult to differentiate between the original and the generated work.

Right now, I have tried to generate samples of two songs together due to hardware restrictions. If a lot of computing power is at your disposal, you can try to build a model for a lot of songs and also by increasing the look_back variable to increase the number of samples used to learn the next sample. You can also ask your friends and family to volunteer for a Turing test of your model for some fun!!

Let me know if you liked this article and if you have any suggestions or other cool ideas in the comments.

The complete code can be found here.

--

--