Classification of Music into different Genres using Keras

We’ll extract various features explained in the blog here. And using these features we’ll classify the music clips into various genres present in our training set.

Image for post
Image for post
Classification after extracting features

We’ll use GTZAN genre collection dataset. If this site doesn’t work than you can get the dataset from here. This dataset consists of 10 genres and each genre consist of 100 music clips each is 30 seconds long.

We’ll be using package Keras which uses tensorflow package at the backend.

import librosa
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import os
import csv
# Preprocessing
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, StandardScaler
import keras

Install all the packages required and import them.

Important note:

Check the version of tensorflow it should be greater than 1.1 otherwise various keras features will fail.

Pip may not get latest tensorflow version so install tensorflow using this command:

pip install --ignore-installed --upgrade

Creating Dataset

We’ll process dataset as per our requirements. We’ll create a CSV file with the data we required.

header = 'filename chroma_stft rmse spectral_centroid spectral_bandwidth rolloff zero_crossing_rate'
for i in range(1, 21):
header += f' mfcc{i}'
header += ' label'
header = header.split()

Here we are generating headers for our CSV file.

If you have read the blog of features extraction we’ll get 20 mfcc for given sampling rate because it is calculated for each frame so mfcc has 20 columns.

Now, we’ll calculate all the features.

file = open('data.csv', 'w', newline='')
with file:
writer = csv.writer(file)
genres = 'blues classical country disco hiphop jazz metal pop reggae rock'.split()
for g in genres:
for filename in os.listdir(f'./genres/{g}'):
songname = f'./genres/{g}/{filename}'
y, sr = librosa.load(songname, mono=True, duration=30)
chroma_stft = librosa.feature.chroma_stft(y=y, sr=sr)
rmse = librosa.feature.rmse(y=y)
spec_cent = librosa.feature.spectral_centroid(y=y, sr=sr)
spec_bw = librosa.feature.spectral_bandwidth(y=y, sr=sr)
rolloff = librosa.feature.spectral_rolloff(y=y, sr=sr)
zcr = librosa.feature.zero_crossing_rate(y)
mfcc = librosa.feature.mfcc(y=y, sr=sr)
to_append = f'{filename} {np.mean(chroma_stft)} {np.mean(rmse)} {np.mean(spec_cent)} {np.mean(spec_bw)} {np.mean(rolloff)} {np.mean(zcr)}'
for e in mfcc:
to_append += f' {np.mean(e)}'
to_append += f' {g}'
file = open('data.csv', 'a', newline='')
with file:
writer = csv.writer(file)

We’ve calculated all the features using librosa package and has created a dataset with the data.csv file name and has inserted all the feature values of given music in given headers.

Preprocessing Dataset

data = pd.read_csv('data.csv')
# Dropping unneccesary columns
data = data.drop(['filename'],axis=1)

‘Filename’ column is not required.

Now, we’ll encode genres into integers

genre_list = data.iloc[:, -1]
encoder = LabelEncoder()
y = encoder.fit_transform(genre_list)

Here, we created a mapping between genres and integers. Each integer represents the specific genre.

Image for post
Image for post
Music genre
scaler = StandardScaler()
X = scaler.fit_transform(np.array(data.iloc[:, :-1], dtype = float))

In this x is calculated by removing the mean and dividing by the variance.

Splitting the dataset into training and testing dataset

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

We’ve split the dataset into training and testing dataset in 80:20 ratio.

Model creating, training and testing

from keras import models
from keras import layers
model = models.Sequential()
model.add(layers.Dense(256, activation='relu', input_shape=(X_train.shape[1],)))
model.add(layers.Dense(128, activation='relu'))model.add(layers.Dense(64, activation='relu'))model.add(layers.Dense(10, activation='softmax'))

We’ll be using keras sequential model.

There are 4 layers in our network. First one is an input layer therefore, input size has to be given. Then there are 2 hidden layers and the last layer is an output layer. The value inside dense represents the dimension of an output space. So the 1st layer has 256 neurons so the dimension of it’s output space is 256. Such that the output layer has 10 neurons as we are classifying into 10 genres.


Before starting a training process we need to configure the training process of how the model should be trained. Optimizer represents an optimization algorithm to be used we’ll be using adam algorithm which is largely used in deep learning. To know more about adam optimizers read this blog. A loss is a function by which we evaluate the network efficiency. The model’s goal is to reduce this loss function. Metric is a metric to be evaluated, we’ll be evaluating accuracy. Loss function sparse_categorical_crossentropy is similar to categorical_crossentropy just used when we have multiple classification fields.

history =,

Using fit function we’ll train the model for given training input and output.

test_loss, test_acc = model.evaluate(X_test,y_test)
print('test_acc: ',test_acc)

We’ll get the accuracy by the model can predict the genre of given music based on the features extracted. This model achieved the accuracy of 67% which is not that good but we can modify the model to achieve higher accuracy.

predictions = model.predict(X_test)

predict function gives us percentage by how much that music matches to each genre. The highest percentage for the given genre is our final result which is calculated by argmax .

You can find the code here.

Written by

Currently working as a Backend Developer. Exploring how to make machines smarter than me.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store