Deep Learning Based Raga Classification in Carnatic Music

18 min readJul 16, 2020

Contributors: Srinivas Athreya ,Dr Vyzarsu Balasubramanyam, Ravindra Somayajulu,Pathi Mohan Rao, Sridhar M, Amal Augusty, Pradeep R

Abstract

The work is about classification of Ragas in Carnatic music using machine learning models. This work is limited to Melakartha/parent Ragas. Files are downloaded from youtube and converted to required format. Features like mel-spectrogram, Mel frequency cepstral coefficients, spectral bandwidth, chroma etc… are extracted using librosa, a python library for audio processing.. We have considered compositions of 72 Ragas and classification was done using different models and the results were encouraging. We have also tried the Raga detection and classification by Swara modelling, swara searching and Raga framing. Team tried with different models for classification LTSM, CNN, RNN, logistic regression, SVM, KNN and DNN. We have observed that DNN resulted in good performance. It was also observed that SVM also resulted in attractive performance. Multiple combinations of feature sets are used.

This has several interesting applications in digital music mixing, song recommendation and information retrieval from music signal. Using Swara modelling and searching, machine intelligence can be improved to a greater extent.

Introduction

In the ancient times music is the heart of Indian. It is one of the most historic and oldest music systems in the world. Its significance in the Indian culture and tradition can’t be overlooked. The root Indian classical music is very rich. Indian classical music is categorized into two major forms: Hindustani and Carnatic music. The present focuses on Carnatic Raga classification. One of the most challenging milestones in the field of Carnatic music analysis is Raga detection. Every composition is structured using Ragas. There are 72 Melakartha/parent Ragas and indefinite Janya Ragas. Raga is considered as backbone is Indian classical music. It is the fundamental concept on which whole melody of a performance is based and comprises of groups of swaras ranging from 5–7 basic notes.

Unlike western music system where swaras/notes frequencies are fixed; in Indian classical music system variations in few notes are allowed. This scheme envisages ‘Sa’ and ‘Pa’ as fized notes, with ‘Ma’ having 2 variants and the remaining swaras having 3 variants each. Swaras ad their variations are given in below table.

Below swarams shares the positions.
● Ga1 and Ri2
● Ga2 and Ri3
● Ni1 and Dha2
● Ni2 and Dha3

Ensuring 7 swarams in every Raga, their sequence and with different combinations of these swaram variations a total 72 Ragas can be derived. These are called Melakartha Ragas. Hence Raga identification or classification is very important to analyse and study compositions.

1) Problem Description

Raga — the melodic framework found in Indian classical music (Carnatic and Hindustani) composes of a sequence of swaras depicting the mood and the sentiments. Indian music has seven basic swaras namely Sa, Ri, Ga, Ma Pa, Da and Ni. This is the sequence or linear order of notes. There are thousands
of Ragas in Carnatic music derived from 72 Janaka/parent ragas; also called Melakartha Ragas formed by the combination of 12 swarasthanas with 16 swara variations. In this project we propose to classify 72 Melakartha ragas and scope is limited to Melakartha ragas. However the problem is hard due to,
a. Absence of fixed frequency for a note/Swaram.
b. Relative scale of notes
c. Oscillations around a note
d. Improvisations

2 )Data Processing

2.1) Domain Knowledge

When the project was started team didn’t have enough knowledge on Carnatic music to start directly working on the project.
1. Team went through the internet to understand the basics of the Carnatic music.
2. Team has contacted many domain experts and Dr. Vyzarsu Balasubramanyam, CEO of Bhairavi Sangeet Academy, one amongst them.
3. Team has gained knowledge on below topics.
a. Basics of Swram/notes and frequencies at which they are generated.
b. Structure of Ragam.
c. Maximum length of Ragam.
d. Theory behind octaves and talam.
e. Differences between Indian classical and western notes.
f. Alankaras in Carnatic music and how they are an obstacle in classification of Ragas.

2.2) Data Collection

Data is not readily available for 72 ragas of Carnatic music as how we require for this project. We have collected
data from www.youtube.com.
1. Links are collected and updated to an excel document against each Raga.
2. A python program is written to automatically down load and convert the files into required format. The
functions of this program is given below.
a. Program accepts excel document as input and read links in sequence.
b. Program downloads only the audio from youtube using youtube-dl, a python package and stores
in the drive. Downloaded files are in mp3 format.
c. Mp3 is compression format which doesn’t help in analyzing the music file, program then
converts from mp3 to wav format using ‘ffmpeg’, an audio processing tool.
d. Again these non-uniform length files are converted to 30sec files.

Ex: If the original file is of length 12min, then this file cut into 30sec files and so total of 24 data samples
are available for the respective Raga.

2.3)Data Cleaning

Data is cleaned by removing speech, applause and narration from every file.
2. All the audio files are converted to .wav format with 16-bit PCM encoding.
3. Converted all the files mono format using only single channel.
4. All the 30sec files are encoded at the same sampling rate of 22050.
5. Music file length was decided to have 30 sec and so file clipping and appending was done to maintain
the same length. This is also done using python scripts.

2.4 )Feature Selection/Extraction

A music file is sampled at a defined rate and segmented into frames. Librosa package is used to analyse and extract the features from the music file.
An example of frame and sample calculation is given below.
● Length of music file = 10s
● Sampling rate = 22050
● Total samples = 10*22050 = 220500.
● Hop length (number of samples in frame) = 512 (a selectable parameter in librosa)
● Frame length = sample rate * hop length = (1/22050)*512 = 23.2ms which fall under standard
frame length of 20–40ms.
● Total number of frames in music file = 2205000/512 = 430.
Spectral Centroid gives the mean frequency of each frame in the music file. Different Ragas are composed of different Swaras and their variations. A frequency is associated with each variation of Swaram. So plotting
of spectral centroid tell us how the frequency is varying.
Spectral Bandwidth gives the frequency bandwidth available in every frame of music file. Chromagram gives information about the pitch classes in each frame of music file. Energy and RMSE gives us the energy of the signal for each frame. This gives the information about loudness of the signal.
MFCC are small set of features which concisely describe the overall shape of the spectral envelope. Mel scale is derived based on the human perception of audio frequencies. CQT is similar to Fourier transform but similar to Mel-scale uses logarithmically spaced frequency axis.

2.5 Exploratory Data Analysis

Below are the observations from the raw data analysis
● Music files are collected for a total of 72 Ragas.
● Data contains only keerthanas from different famous Carnatic music singers.
● This data doesn’t contain any filmy songs as they are mix of Ragas and are mostly Janya Ragas.
● These audio files contains mix of vocal and instrumental.

● These audio files contain both male and female singers.
● These audio files are mix of different voice qualities.
● Each file is of different length and so every file is cut into equal length of 30sec.
● Below table contains the number of samples used for each Raga.

From the above table we can understand that number of samples for each Raga are not equal. So
definitely bias gets introduced into the model and class accuracy will be effected.
● It is highly possible that some music files may contain only instrumental for longer duration.
● It is also possible that some music files may contain longer duration silence periods or higher number
of shorter duration silence periods.
● Above two parameters increases the differences between number correct samples per Raga.
● This unequal number effects the class accuracy.
● It is also possible that unintended 30sec file is classified under a Raga.
● Model understanding unintended data as Raga is the wrong intelligence and this finally effects the
prediction capability of the mode.

3) Approaches in Classifying Raga

Characteristics of Melakartha ragas
1. Sampurna: Melakartha ragas always contain 7 notes (any of 7 out of 12)
2. Krama: All the 7 notes are always in linear order.
3. Ekagunathva: 7 notes of same variations are present in Arohana and Avarohana.
4. Melakartha ragas always starts with first note Sa and the fifth note is always Pa.

3.1 Raga Classification by Raga Modelling
3.1.1 Sequential models:

It is clear from the above characteristics that Melakartha ragas has linear order of notes over time. This is a clue to use sequential neural networks RNN/LSTM.
Model Structure & Results:
Features considered are given below.
1. Spectral centroid
2. Spectral bandwidth
3. Mel-spectrogram
4. Mel frequency cepstral coefficients.
5. Chroma

Librosa API package is used to extract these features from music file.
The accuracy is low with any of the features used. Below are the observations why LSTM could not give expected results.
● Every 30sec file may not be having the one octave of full Raga. So there is possibility that,
a. There will be repetition of Raga or
b. Incomplete Raga or
c. Raga is different Talam or
d. Ragam in different octave or
e. Mix of above
● It was understood that every 30sec/X sec music file contains complete sequence of Raga.
Getting such a file from Keerthana music file is not possible. So LSTM couldn’t result in better
accuracy with any number hidden units.

Model: In this model samples were taken for 13 Ragas and we observed the behaviour.
Layer1 — Bidirectional LSTM — 1000 hidden units
Layer2 — Dense layer with 64 neurons and Relu activation
Layer3 — Dense layer with 128 neurons and Relu activation
Layer4 — Dense layer with 256 neurons and Relu activation
Layer5 — Dense layer with 512 neurons and Relu activation
Layer6 — Dense layer with 1024 neurons and Relu activation
Layer7 — Dense layer with 13 neurons and Softmax activation
Optimizer =Adam; Loss function= categorical_crossentropy; epochs=10; batch size=10
Results: Train accuracy= 98.4%; validation accuracy = 61.4%;

3.1.2 Deep Neural Network:

Features considered are given below.
1. Mel-spectrogram
2. Mel frequency cepstral coefficients.
3. Chroma
Model Structure & Results:
Following different combinations are checked with deep neural networks.
1. MFCC with 128 coefficients alone as the feature set.
2. Started with one layer DNN model with relu activation and with softmax at output layer and
results are below. L1 neurons=256;Batch size =32; Optmizer = ADAM; Batch Norm = No

3. L1 neurons=256-relu; L2 neurons=256-relu; Batch size =32; Optmizer = ADAM; Batch Norm = No

4. L1 neurons=256-relu; L2 neurons=256-relu; Batch size =32; Optmizer = ADAM; Batch Norm = Yes

5. L1 neurons=256-relu; L2 neurons=256-relu with dropout =0.5; Batch size =32; Optmizer = ADAM;Batch Norm = Yes

With the introduction of dropout in layer 2 it is clearly seen that train and val loss are closer.

6. L1 neurons=256-relu with 0.5 dropout; L2 neurons=256-relu; Batch size =32; Optmizer = ADAM;Batch Norm = Yes

With the introduction of dropout in layer1 it is observed that train loss higher than val loss
7. L1 neurons=256-relu with 0.5 dropout; L2 neurons=256-relu with dropout 0.5; Batch size =32;Optmizer = ADAM; Batch Norm = Yes

8. Tested with different values of dropouts and observed that train and val losses are higher with higher dropout values losses reduced with higher epochs and lower dropout values and so 0.5 dropout is chosen with higher epochs during training.
9. After freezing 0.5 dropout at layer2 we started reducing the train time by increasing the batch size. It is also observed that train time reduced from 4sec when batch size =32 to 4us when batch size =2048
10. It is also observed that by increasing batch size model performance also increased with reduction in number of epochs. Plots of the same are given below.

With Batch size = 512

11. Once the batch size frozen then we have started reducing the test time and by optimizing the number neurons used.
12. Currently model has two dense layers. Number of neurons at layer1 reduced from 256 to 16 and keeping layer2 neurons at 256. Under this scenario the performance of the model reduced when neurons reduced below 128. Below plot shows layer1 =16 and layer2 =256

13. Currently model has two dense layers. Number of neurons at layer2 reduced from 256 to 16 and keeping layer1 neurons at 256. Under this scenario the performance of the model reduced compared to above point when neurons reduced below 128. Below plot shows layer2 =16 and layer1=256

14. Different combinations of neurons at layer1 and layer2 were tried to reduce the total test time.With different combinations it is concluded that instead of 256 @ layer1 and 256 @ layer2 better performance can be achieved by increasing layers with reduced number of neurons.Frozen neurons: Layer1=256, layer2=128, layer3=32, layer4=32; with 0.5 dropout at layer, batch size=2048; optimizer=ADAM, loss function=categorical cross entropy.
15. Then we have started changing the activation functions and the results are below.

Sigmoid as activation function at all layers.

16. Final DNN model parameters are given below.
a. Layer1=256; Activation=ELU; dropout=0.5
b. Layer2=128; Activation=ELU; dropout=no dropout
c. Layer3=32; Activation=ELU; dropout=no dropout
d. Layer4=32; Activation=ELU; dropout=no dropout
e. Optimizer = ADAM;

f. Loss function = Categorical cross entropy.
g. Output layer = softmax;
17. The results were not exciting when mel-spectrogram features fed as input to the frozen model even high number of epochs.

18. The results are even bad when chroma is used as input features.
After 5000 epochs:
Train loss = 1.208
Train acc = 0.65
Val loss = 1.4091
Val acc = 0.638
19. By including MFCC in the feature along with others or alone train and validation results are
exciting.
20. Link for the DNN code:
21. Below tables have the final results of test data, i.e… Predictability of the model.

3.1.3 Logistic Regression

After observations from the DNN team has tried with Logistic regression and results are below. In this input feature used is MFCC. Regression could not perform well even with different hyper parameters and so this model is dropped.

3.1.4 SVM

Team has tried with SVM model with different hyper parameters and results are below.
1. Results are very poor that the model is overfitting when scaling is not done.
No scaling: Train score=1 and test score=0.05
With scaling: Train score=0.961 and test score=0.92
2. It is also found that RBF kernel is performing better that linear kernels.
RBF is tested for different C values and the results are below. Started with C=0.1 and tested till C=1000.
From results we understood at what C value model is getting frozen.

Train score is frozen from C value =4 and test score is frozen from C value =6. Any C value above 6, the train and test scores remained the same.

3.1.5 KNN

Team has tried with KNN model with scaled input and non-scaled inputs. Results were very good
under both scenarios.
2. We have tried with different number neighbours and below plot shows the results.

3.1.6 CNNs, CNN & RNN, CNN & LSTM, CNN & GRUs

Team has tried with CNN and multiple combinations of CNN with other sequential models. Features used are mentioned below.

Features Used — Tempogram, Melspectrogram, MFCC, Chroma_stft,Chroma_cqt,Chroma_cens,RMS,RMSE,

Spectral_centroid,Spectal_bandwidth,Spectral_contrast, Spectral_flatness,Speactral_rolloff,Tonnetz,Zero_crossing_rate

3.1.6.1 CNN on raw data

Layer1 — Conv1D of filter 3, 16 neurons and Relu activation
Layer2 — Batch Normalization
Layer3 — Maxpooling1D of pool_size 4
Layer4 — Conv1D of filter 3, 64 neurons and Relu activation
Layer5 — Batch Normalization
Layer6 — Maxpooling1D of pool_size 4
Layer7 — Conv1D of filter 3, 128 neurons and Relu activation
Layer8 — Batch Normalization
Layer9 — Maxpooling1D of pool_size 4
Layer10 — Flatten
Layer11 — Dense layer with 90 neurons and Relu activation
Layer12 — Dense layer with 30 neurons and Relu activation
Layer13 — Dense layer with 5 neurons and Softmax activation
Optimizer =Adam; Loss function= categorical_crossentropy; epochs=10; batch size=10
Results: Train accuracy= 50%; validation accuracy = 35%.

3.1.6.2 CNN with 15 features

Layer1 — Reshape
Layer2 — Conv2D of 3*3 filter, 16 neurons and Relu activation
Layer3 — Batch Normalization
Layer4 — Maxpooling2D of pool_size 4*4
Layer5 — Conv2D of 3*3 filter, 64 neurons and Relu activation
Layer6 — Batch Normalization
Layer7 — Maxpooling2D of pool_size 4*4
Layer8 — Conv2D of 3*3 filter, 128 neurons and Relu activation
Layer9 — Batch Normalization
Layer10 — Maxpooling2D of pool_size 4*4
Layer11 — Flatten
Layer12 — Dense layer with 90 neurons and Relu activation
Layer13 — Dense layer with 30 neurons and Relu activation
Layer14 — Dense layer with 5 neurons and Softmax activation
Optimizer =Adam; Loss function= categorical_crossentropy; epochs=10; batch size=10
Results: Train accuracy= 86%; train loss=0.39; validation accuracy = 70%; validation loss=0.7.

3.1.6.3 CNN and RNN with 15 features

Layer7 — Conv1D of filter 3, 128 neurons and Relu activation
Layer8 — Batch Normalization
Layer9 — Maxpooling1D of pool_size 4
Layer10 — SimpleRNN with 300 neurons
Layer11 — Flatten
Layer12 — Dense layer with 90 neurons and Relu activation
Layer13 — Dense layer with 30 neurons and Relu activation
Layer14 — Dense layer with 5 neurons and Softmax activation
Optimizer =Adam; Loss function= categorical_crossentropy; epochs=10; batch size=10
Results: Train accuracy= 98%; train loss=0.07; validation accuracy = 65%; validation loss=1.15

3.1.6.4 CNN and LSTM with 15 features

Layer1 — Reshape
Layer2 — Conv2D of 3*3 filter, 16 neurons and Relu activation
Layer3 — Batch Normalization
Layer4 — Maxpooling2D of pool_size 4*4
Layer5 — Conv2D of 3*3 filter, 64 neurons and Relu activation
Layer6 — Batch Normalization
Layer7 — Maxpooling2D of pool_size 4*4
Layer8 — Conv2D of 3*3 filter, 128 neurons and Relu activation
Layer9 — Batch Normalization
Layer10 — Maxpooling2D of pool_size 4*4
Layer11 — Reshape
Layer12 — LSTM with 16 neurons
Layer13 — Flatten
Layer14 — Dense layer with 90 neurons and Relu activation
Layer15 — Dense layer with 30 neurons and Relu activation
Layer16 — Dense layer with 5 neurons and Softmax activation
Optimizer =Adam; Loss function= categorical_crossentropy; epochs=10; batch size=10
Results: Train accuracy= 43%; train loss=1.27; validation accuracy = 48%; validation loss=1.27.

3.1.6.5 CNN and GRU with 15 features

Layer1 — Conv1D of filter 3, 16 neurons and Relu activation
Layer2 — Batch Normalization
Layer3 — Maxpooling1D of pool_size 4
Layer4 — Conv1D of filter 3, 64 neurons and Relu activation
Layer5 — Batch Normalization
Layer6 — Maxpooling1D of pool_size 4
Layer7 — Conv1D of filter 3, 128 neurons and Relu activation
Layer8 — Batch Normalization
Layer9 — Maxpooling1D of pool_size 4
Layer10 — GRU with 300 neurons
Layer11 — Flatten
Layer12 — Dense layer with 90 neurons and Relu activation
Layer13 — Dense layer with 30 neurons and Relu activation
Layer14 — Dense layer with 5 neurons and Softmax activation

Optimizer =Adam; Loss function= categorical_crossentropy; epochs=10; batch size=10
Results: Train accuracy= 95%; train loss=0.19; validation accuracy = 50%; validation loss=1.88.

3.2 Raga Classification by Swaram Modelling

This approach is based on the fact that Raga is composed of, notes of multiple variations.
Requirement of ideal data:
1. Music file should contain only vocal without any instruments (not even Shruthi box)
2. Music files should be available for every note with its variation.
3. Music files should be available for every note in different octaves.
4. Music files should be available for every note with Gamaka mix.
5. Music files should be available for every note in different talams.
Availability of data:
1. Ragams from different online music online classes were collected.
2. Notes portions were extracted from these files.
Method:
1. A model have to be designed which classifies 16 Swaram/note variations.
2. A window of defined duration is extracted from test file and is fed to this model to understand
the Swaram.
3. By moving this window the complete test audio file fed to the model to understand all the
Swarams in it.
4. From this we can get all the Swarams its time point in the test audio file.
5. Based on the sequence of the Swarams, Raga will be identified.
Implementation Steps:
1. Amplitude information is extracted from every note music file.
2. Spectral Centroid gives the mean frequency of each frame in the music file. Different Ragas are composed of different Swaras and their variations. A frequency is associated with each variation of Swaram. So plotting of spectral centroid tell us how the frequency is varying. Sample plots of swarams are given below.

Above figures 1 & 2 shows clearly the difference between frequency variations between swarams. When these music files are modelled results are not as expected due to
1. Non-availability of required number of music files.
2. As these files are manually clipped off from a Ragam file and so uniformity across the files is lost.
3. Due to the fact that required number train music files are not available for the available variations are notes and so this couldn’t be converted to a model. So this method is parked and will be considered for future.

4) Comparison to the Benchmark

● We have used compu-music dataset to understand the prediction capability of the model. Even
though model results were very good during training and validation but model failed to classify
when a new dataset is given.
● From this we understood that model has not learnt multiple variations of the data.

● We have also understood that we could not get these different variations of the data due to lack of
domain knowledge in understanding these variations.
● But definitely it is very clear for us that a DNN model has produced very good results when
compared to models used in earlier works.
● It was also observed that SVM model is good enough to use.

5) Implications & Limitations

● We have practically observed that multiple variations of data is very important for the model to
build the intelligence.
● It was already mentioned in section 4 that multiple variations data was not available to train the
model and so model predictability is limited to only certain category music.

6) Our Learnings

● Team has understood step by step tuning of each hyper parameters of different models like CNN, DNN,
LSTM, SVM and KNN. Also observed and understood how each parameter is contributing in reducing the
loss and controlling the overfitting.
● Topic considered was new to the team: Team has understood about data cleaning, feature extraction
and feature selection.
● We have also observed practically that how different features of topic are contributing to the loss and
accuracy of the model.
● We have understood that sometimes direct feature set can’t be used as it is. We have derive statistical
parameters from a feature to tune accuracy, optimisation and run time.

7 )References

7.1 )Carnatic Music

● http://carnatica.net/origin.htm
● http://www.ragasurabhi.com/carnatic-music/ragas.html
● https://brilliant.org/wiki/mathematics-of-music/
● https://audiogyan.com/2017/06/13/vidyadhar-oke-part1/
● https://sunson.livejournal.com/161455.html
● http://www.carnaticcorner.com/articles/22_srutis.htm
● https://compmusic.upf.edu/iam-tonic-dataset
● https://sites.google.com/site/kalpsangeethasabha/ragas/Melakarta-Ragas
● http://swaranidhi.org/egl_72melakartharagalu.html
● http://www.melakarta.com/index.html
● ‘Book: Facets of Learning of Carnatic Music’. Compiled by Dr Vyzarsu Balasubrahmanyam.

7.2 )Audio Code

● https://www.analyticsvidhya.com/blog/2017/08/audio-voice-processing-deep-learning/
● https://librosa.github.io/librosa/
● http://myinspirationinformation.com/uncategorized/audio-signals-in-python/
● https://musicinformationretrieval.com/stft.html
● https://towardsdatascience.com/urban-sound-classification-part-1-99137c6335f9
● https://towardsdatascience.com/music-genre-classification-with-python-c714d032f0d8

7.3 )Audio Science

● https://www.youtube.com/watch?v=_FatxGN3vAM
● http://practicalcryptography.com/miscellaneous/machine-learning/intuitive-guide-discrete-fourier-trans
form/
● http://practicalcryptography.com/miscellaneous/machine-learning/guide-mel-frequency-cepstral-coeffi
cients-mfccs/

7.4 )Dataset

The excel document in the below link contains the links of all the youtube links.
https://docs.google.com/spreadsheets/d/1bDwSs98o1kBjsVUsxGTRQcLQ_oaFWl_1ZUudSeMcD8Y/edit#gid=0