Audio Data Augmentation in python

3 min readApr 15, 2020

In this post, I am going to show you how we can build a method for generating more samples in our dataset using data augmentation for audio files.

Let’s get started

Data augmentation is a method for generating synthetic data i.e. creating new samples by tweaking small factors in the original samples. By altering these small factors we can get large amount of data for a single sample. This not only helps us to increase the size of our dataset but also gives multiple variations of single sample which helps our model to avoid overfitting and become more generalized.

We are going to use free-spoken-digit-dataset dataset. It is a free audio dataset of spoken digits. Think of it as MNIST for audio. It consists of 2000 recordings by 4 speakers (50 of each digit per speaker).

librosa, IPython.display.audio and matplotlib libraries are extensively used in this post. Before we continue It would be better to have a good background about these libraries.

Types of augmentation

Sound wave has following characteristics: Pitch, Loudness, Quality. We need to alter our samples around these characteristics in such a way that they only differ by small factor from original sample.

I found following alterations to a sound wave useful: Noise addition, Time shifting, Pitch shifting and Time stretching. We will see how these affects our original sample using spectrogram and play these altered audio files.

Visualizing original sample

We will use librosa to read the .wav file and matplotlib to generate spectrogram of wav file. Below is the code for visualization

Script for plotting spectrogram and amplitude graph

Noise Addition

This process involves addition of noise i.e. white noise to the sample. White noises are random samples distributed at regular intervals with mean of 0 and standard deviation of 1.

For achieving this we will be using numpy’s normal method generate above distribution and add it to our original sample:

Script for adding white noise

Time Shifting

Here we shift the wave by sample_rate/10 factor. This will move the wave to the right by given factor along time axis.

For achieving this I have used numpy’s roll function to generate time shifting.

Script adding time shift

Time Stretching

The process of changing the speed/duration of sound without affecting the pitch of sound. This can be achieved using librosa’s time_stretch function.

Time_stretch function takes wave samples and a factor by which to stretch as inputs. I found that this factor should be 0.4 since it has a small difference with original sample.

Script for time stretching

Pitch Shifting

It is an implementation of pitch scaling used in musical instruments. It is a process of changing the pitch of sound without affect it’s speed.

Again we are going to use librosa’s pitch_shift function. It takes wave samples, sample rate and number of steps through which pitch must be shifted. I found that number of steps between -5 to 5 are much favorable as per our dataset.

Script for pitch shifting

You can find the complete jupyter notebook for audio data augmentation here

Congratulations you have successfully performed data augmentation on audio dataset.

If you found this article helpful, hit the clap button and help me get it to more people who need help getting started in data augmentation. I appreciate your responses as well. You can also find me on Twitter