Audio Classification using Librosa and Pytorch

Hasith Sura
6 min readJan 20, 2020

--

Introduction

With the advancement in the research of deep learning, its applications using audio data have become numerous such as Audio Classification, Audio Source Separation, Music Transcription and more. Let’s take a look at how features are extracted from an audio and Audio Classification is performed.

In this article, we will walk through the process of Audio Classification in the following Steps

  1. Understanding Audio Data
  2. Look at the Dataset
  3. Data Preprocessing
  4. Loading data in Pytorch
  5. Building Our Model
  6. Training
  7. Results

Understanding Audio Data

Sound is the term to describe what is heard when sound waves pass through a medium to the ear. All sounds are made by vibrations of molecules through which the sound travels.

The amplitude of the vibrations defines the loudness of the sound and frequency is the speed with which the amplitude changes. See how sound waves of different, constant frequency sound on the following website.

The sounds we normally hear do not consist of sound waves of just one frequency but multiple frequencies, meaning sound waves of different, constant frequencies interact with each other to produce a new sound wave. This new sound wave is the sum of all those sound waves. This is an important concept for Data Preprocessing.

Sound waves are digitally stored by sampling them at discrete intervals. A sample denotes the amplitude of the sound wave at a specific point of time.

A sound wave, in blue, represented digitally, in red. Each red dot represents the normalized value of a sample. source : digitalsoundandmusic.com

The number of samples sampled per second is called the sampling rate. Audio CDs are sampled at 44.1KHz, which means samples are taken 44100 times per second. The bit depth determines the range of values of a sample. Typically 16bit is used which makes the range between -32768 and +32767. The more is the bit depth, the more detailed is the sample. The value of a sample is zero when there is no sound. The negative and positive values represent high pressure and low pressure i.e., compressions and rarefactions.

Look at the Dataset

For this, we will use Environmental Sound Classification Dataset. The dataset can be downloaded from here.

The dataset consists of 40 audio files for each of the 50 categories making a total of 2000 audio files. The 50 categories are

source : https://github.com/karolpiczak/ESC-50

Each audio file is sampled at 44.1 KHz with one audio channel.

Lets take a look at waveplot of an audio file of Clock tick category.

The amplitudes are periodically high which intuitively makes sense since clock tick sounds are produced periodically. Waveplots of some audio files of other categories are

Data Preprocessing

Librosa is a python package for audio and music analysis. We will use librosa to load audio and extract features. By default, Librosa’s load converts the sampling rate to 22.05KHz and normalizes the data so that the sample values are between -1.0 and +1.0 and converts stereo(two channels) to mono(one channel). We specify sr=None to use a native sampling rate of 44.1KHz.

Extract Features

It is very difficult to classify audio based on raw audio samples. So we extract features from audio which makes it easier to classify audio.

Each audio is a mix of multiple sound waves of different frequencies.

A spectrogram is a visual representation of the spectrum of frequencies of a signal as it varies with time.

Spectrogram shows which frequencies are active at a particular time. A spectrogram is a graph with time as x-axis and frequencies as the y-axis. The intensity of a pixel in a spectrogram image indicates the amplitude of a particular frequency at a particular time.

In Spectrogram construction, the audio data is sliced into overlapping windows of small time frames and Fourier Transform is applied to each window. Fourier Transform decomposes a signal into its constituent frequencies.

A spectrogram shows frequencies in linear scale but our ear can discriminate lower frequencies more than higher frequencies. So, we transform the spectrogram’s amplitudes from linear scale to Mel scale. Mel scale aims to mimic the non-linear human ear perception of sound. The resultant spectrogram is called Mel Spectrogram. The conversion into the mel scale is performed using mel filters. Frequencies on the linear scale are multiplied with mel filters to get frequencies on the mel scale.

We also perceive loudness on a logarithmic scale. So, we transform amplitude into the decibel scale.

To understand Mel Spectrogram in-depth and how it is calculated, check the following links.

sr = None specifies that librosa should use the native sampling rate of 44.1KHz to load the audio data. Next, the first 5 seconds of the given audio is extracted. 2048 samples are chosen for each window which is approximately 46ms and a hop_length of 512 samples is chosen which means the window is moved by skipping 512 samples to get the next time frame. The number of mel filters is 128 which makes the height of spectrogram image 128. fmin and fmax are the lowest and highest frequencies respectively. Mel filters are calculated in such a way that the frequencies in between fmin and fmax are projected onto the mel scale.

While constructing Mel Spectrogram, librosa squares magnitude of the spectrogram. So we use power_to_db to convert power magnitude to decibels. top_db is used to threshold the output.

Spectrogram to Image

Next, We convert Spectrogram into an image.

The spectrogram is normalized using z score normalization and scaled using min-max scaling so its values lie between 0 and 255.

Loading data in Pytorch

Next, we build dataloaders to preprocess and load data. Download and extract data from here and change directory into the dataset directory.

Building Our Model

We can either use our custom model or make some changes and use a pre-trained model. Pre-trained models converge faster and give higher accuracy so Let’ opt for resnet34 with some changes.

The first conv1 layer of resnet34 accepts 3 channels so it is changed to accept 1 channel. The final fc layer generates output for 1000 categories so it is changed to 50 categories.

Training

For training, CrossEntropyLoss and Adam optimizer are used. The model is trained for 50 epochs with learning rate decreasing to its tenth for every 10 epochs.

Results

Our model achieves an accuracy of around 80% which generalizes well for new audio data.

Here is the link for my deployed app.

Here is the link for my GitHub repository which also contains deployment files.

--

--