Who’s the Songster !

Outline :
Speaker recognition is the identification of a person from characteristics of his/her voice is an important human trait most take for granted in natural human-to-human interaction/communication. It is also called voice recognition. There is a difference between speaker recognition (recognizing who is speaking) and speech recognition (recognizing what is being said). These two terms are frequently confused, and “voice recognition” can be used for both[for more details].

Goal : Building a deep learning model that can recognize the voice of a artist(also known as dynamic voice identifer)for a given song with minimal training data.

Github Link : https://github.com/Hina19/Whos-The-Songster-

Overview : To create such a system, naturally the tool of choice would be an image classifier. Basically, this is the high level view of what i am going to do in building this recognition task.

So, in the first step i am going to collect the audio files of different songsters(artists) in .wav format and then convert all audio files into a particular spectrogram(image representation) and after that extract features from images using CNN and then apply ML ensemble model gradient descent boosting.

Lets get start !

As there was no publicly available data set for voice classification in songs so we have to create our own in order to train a model. So, for that our first attempt is to collect atleast 40 songs each for your favourite artists(Try to collect for more than 8 artists that are going to be your different classes) in any format using any music app. I know its time taking so you can find your own short way to do that task but let me tell you mine one !

As i am a big fan of Bollywood songs, so i downloaded atleast 500 solo songs of my 10 favourite artists in a different folder like Arijit, Atif, Armaan, Shreya, Sunidhi, Sonu Nigam etc(atleast 40 each). For doing so, i used Gaana music app. Like me if you don’t have an account over there then you can use IDM(Internet Download Manager) to download directly over there in a .TS file. Using IDM you can make your dataset of songs in a very short time !

Now, our next task is to convert all those .ts format into .wav format. For that you can use ffmpeg(most powerful and versatile command line tool for converting audio and video files) in python.

To convert .ts into .wav :

if __name__ == "__main__":
sourceDir = sys.argv[1]
destDir = sys.argv[2]
for file in os.listdir(sourceDir):
name = file[:file.rfind(".")]
subprocess.run(["ffmpeg", "-i", sourceDir+"\\"+name+".ts", destDir+"\\"+name+".wav"]

Finally, You are done with your collection !

Spectrograms of Audio files :

Once, you have your dataset, the next step is to convert these audio files into spectrograms image representation. For that first you have to load your audio file.

Why Spectrograms ! : If you are referring to speaker recognition (or voice recognition), it is common to visualise a given speaker’s voice pattern as spectogram (or spectrograph) and compare spectrograms as a pattern recognition.

For reading audio file:

import scipy.io.wavfile
sample_rate, X = scipy.io.wavfile.read('filepath')
Challenges with the data:
* As your data consists of various songs from different artists so a single song contains different noise in the form of background music and,
* For a particular artist, there may be chances of different vocal range due to filters in few songs.
So, in order to deal with the first challenge we take different clips of 20secs each for every song where there may be voice and create spectrograms for all clips.

For plotting spectrogram :

len = np.shape(X)[0]/float(rate) #to show the length of song
if len > 60:
X = X[rate*30:rate*60] # to take only a part of a song
X = scipy.mean(X, axis=1)
plt.specgram(X, Fs=rate)

To save your spectrogram images :

plt.savefig("imgs_new/{}{}.png".format(artist,f.split('\\')[2].split('.')[0]), bbox_inches='tight')

For getting labels for your images:

data = glob("spectrogram\\*.png")
labels = []
for image in data:

If all things are running fine then the spectrogram of your single audio file of a particular artist looks like this :

Using CNN Architecture:

I create a CNN by modifying an existing VGG-16 and train it on spectrograms from 10 unique speakers. For all convolutional layers i use a 3x3 kernel. For the max pooling i use a pool size of 2x2. I use relu activation functions between each layer and a dropout of 0.1 and a softmax activation function for the last layer. Our loss function is categorical cross-entropy with adam as optimizer function.

model = Sequential() 
model.add(Dense(1024, activation=’relu’))

model.add(Dense(num_classes, activation=’softmax’))

loss=’categorical_crossentropy’, metrics=[‘accuracy’])

Performance :

Since the data is less so after 100 epochs, the performance is not good and it tends to overfit !

history = model.fit(x_train, y_train,  
validation_data=(x_test, y_test))

ML Model and Transfer Learning:

  1. Feature Extraction:

I use a pre-trained VGG-16(16-layers Convolutional Neural Network) CNN model in order to extract features from the spectrogram images. This method is also known as transfer learning. We “transfer the learning” of the pre-trained model to our specific problem statement.

# build the VGG16 network
model = applications.VGG16(include_top=False,weights='imagenet')
generator_train = datagen.flow_from_directory(
target_size=(img_width, img_height),
# To extract features
bottleneck_features_train = model.predict_generator(generator_train, nb_train_samples // batch_size)
# To store features
np.save(open('data_features.npy', 'wb'), bottleneck_features_train)

2. Learning Model :

Using CNN as a feature extractor we have data in ~2000 dimensions. I use gradient descent Xgboost as a learning model for voice classification.

Why gradient descent Xgboost! : XGBoost is an ensemble method as it uses many trees to take a decision so it gains power by repeating itself. Also, Tree based approaches are very robust. They can work on a wide variety of problems and can capture dependencies in ways linear models can not like SVM( choosing the kernel for SVMs can be difficult as it has no ability to choose the right kernel for you !). Boosting results particularly xgboost often improve performance.
Now, in order to deal with the second challenge we calculate accuracies with Top-3 predictions due to similarity between the voices of singers in few songs.

What you can understand from above picture is that for a single song you will get 10 probabilities of its belonging to 10 different class and which class is having highest probability, we predict that class label(artist) to that song and if it matches the actual class label we increase the count(thats called Top-1 prediction) otherwise we look into second highest probability, predict the class label if it corrects then increase count (thats called Top-2 prediction) otherwise do the same for Top-3 prediction. Finally, we calculate the accuracies for Top-1, Top-2 and Top-3.

Applying Model In Real Time :

Steps For Real Time Predictions

Lets do that for one song…!

Clips 20 sec each :

rate, X = scipy.io.wavfile.read('path') 
x1 = X[rate*30:rate*50]
x2 = X[rate*50:rate*70]
x3 = X[rate*60:rate*80]
x4 = X[rate*120:rate*140]
x5 = X[rate*140:rate*160]
[kishore_kumar, kumar_sanu , kishore_kumar , kishore_kumar, kumar_sanu]
Majority Vote:
Predicted Singer Percentage : 
[('kishore_kumar', 80.0), ('kumar_sanu', 20.0)]

Results :

Results were largely positive and the results of a few live tests of our model you can see below :

Found 5 images belonging to 1 classes.
Original Singer : atif
Predicted Singer : atif
Predicted Singer Percentage : [(‘atif’, 40.0), (‘kishore_kumar’, 20.0), (‘mohit’, 20.0), (‘sunidhi’, 20.0)]
Music Audio :
Found 5 images belonging to 1 classes.
Original Singer : arijit
Predicted Singer : arijit
Predicted Singer Percentage : [(‘arijit’, 100.0)]
Music Audio :
Found 5 images belonging to 1 classes.
Original Singer : shreya
Predicted Singer : shreya
Predicted Singer Percentage : [('shreya', 60.0), ('shaan', 40.0)]
Found 5 images belonging to 1 classes.
Original Singer : mohit
Predicted Singer : mohit
Predicted Singer Percentage : [('mohit', 40.0), ('kumar_sanu', 20.0), ('arijit', 20.0), ('atif', 20.0)]
Music Audio :

Found 5 images belonging to 1 classes.
Original Singer : sunidhi
Predicted Singer : sunidhi
Predicted Singer Percentage : [('sunidhi', 100.0)]
Music Audio :

Errors :

There are also some mis-predictions due to some similarity between the frequencies of voices for some songsters in few songs :

Song Name : LE JAA MUJHE 
Found 5 images belonging to 1 classes.
Original Singer : armaan
Predicted Singer : sonu_nigam
Predicted Singer Percentage : [('sonu_nigam', 60.0), ('armaan', 20.0), ('arijit', 20.0)]
Music Audio :
Found 5 images belonging to 1 classes.
Original Singer : mohit
Predicted Singer : sonu_nigam
Predicted Singer Percentage : [('sonu_nigam', 60.0), ('mohit', 40.0)]
Music Audio :
Found 5 images belonging to 1 classes.
Original Singer : sunidhi
Predicted Singer : sonu_nigam
Predicted Singer Percentage : [('sonu_nigam', 40.0), ('arijit', 40.0), ('sunidhi', 20.0)]
Music Audio :

Future Work :

Although this model performs rather well, but if you want to increase the accuracy then you can include more examples for each artist to train your model and do some feature selection to avoid overfitting before feeding them into the model.

Thats all ! for the project. I hope you enjoyed this so called musical tutorial :-)

Thanks for reading.. !!!