Music Genre Classification

Viren Dhanwani
Analytics Vidhya
Published in
3 min readMar 3, 2020
Photo by Natalie Cardona on Unsplash

After doing small projects on Sentiment Analysis and Clustering, I wanted to do a project on Classification. Do an Image Classification with help of CNN you would say. But I wanted to do something different, step in Music Genre Classification.

After searching for data sets on Kaggle, I found one which contained song features pulled from Spotify like Genre, Popularity, Danceability, Valence, Tempo etc.

df.head()

List of all the columns in the dataset

df.columns

Now analyzing the genre column, the one which we want to predict.

We see that ‘A Capella’ genre songs very less compared to other genres. Thus causing Class Imbalance. We would later remove it so that it doesn’t affect the performance of the classifiers.

Next checking if there are any missing values and duplicate values. The check for missing values returns 0, which points out that data set is complete. For duplicate values, there was an exception(al) song.

Another statistic that can have serious implication on the performance of classifiers.

To make our data ready for our classification we would do the following:

  • Remove columns like artist name, track name and track id which in no way are related to predicting genre of a song
unused_col = ['artist_name', 'track_name', 'track_id']df = df.drop(columns=unused_col).reset_index(drop=True)
  • Removing A Capella genre from the data set
df = df[df['genre'] != 'A Capella']
  • Changing categorical values into numerical or boolean values.
mode_dict = {'Major' : 1, 'Minor' : 0}key_dict = {'C' : 1, 'C#' : 2, 'D' : 3, 'D#' : 4, 'E' : 5, 'F' : 6,'F#' : 7, 'G' : 9, 'G#' : 10, 'A' : 11, 'A#' : 12, 'B' : 12}df['time_signature'] = df['time_signature'].apply(lambda x : int(x[0])) //converting fraction into whole number since denominator is commondf['mode'].replace(mode_dict, inplace=True)df['key'] = df['key'].replace(key_dict).astype(int)

The data is now ready for classification. I decided to use three classifiers- Logistic Regression, Random Forest and Decision Tree. For each of these data was split into 30% testing data and 70% training data.

First, Logistic Regression. This classification model is preferable for binary classification, but I still decided to go for it since Scikit Learn gives an option for multi class classification.

lr_model = LogisticRegression(multi_class = 'multinomial', solver='lbfgs', max_iter=500, verbose=1)

Unfortunately this only yielded an accuracy of 11% :(

Second, Random Forest Classifier. This can be considered as the mutated version of Decision Tree Classifier. In most cases Random Tree classifier has better accuracy than Decision Tree classifier, let’s see what happens in our case.

rfc_model = RandomForestClassifier(n_estimators=50, random_state=5, verbose=1)

This yields us an accuracy of 37%. Not good enough but much better than Logistic Regression.

Lastly, Decision Tree Classifier.

dt_model = DecisionTreeClassifier(max_depth=10, random_state=20)

Apparently, this also yields us an accuracy of 37%. So, it turns out that performance of Random Forest and Decision Tree is same for our data.

In the end, this was a small step for me in the world of classifiers. The performances of the classifiers can be improved by tuning some parameters. I will dive into that later and write some other post about it, as its a whole different thing to step into.

--

--