Audio Data Augmentation in python
In this post, I am going to show you how we can build a method for generating more samples in our dataset using data augmentation for audio files.
Let’s get started
Data augmentation is a method for generating synthetic data i.e. creating new samples by tweaking small factors in the original samples. By altering these small factors we can get large amount of data for a single sample. This not only helps us to increase the size of our dataset but also gives multiple variations of single sample which helps our model to avoid overfitting and become more generalized.
We are going to use free-spoken-digit-dataset dataset. It is a free audio dataset of spoken digits. Think of it as MNIST for audio. It consists of 2000 recordings by 4 speakers (50 of each digit per speaker).
librosa, IPython.display.audio and matplotlib libraries are extensively used in this post. Before we continue It would be better to have a good background about these libraries.
Types of augmentation
Sound wave has following characteristics: Pitch, Loudness, Quality. We need to alter our samples around these characteristics in such a way that they only differ by small factor from original sample.
I found following alterations to a sound wave useful: Noise addition, Time shifting, Pitch shifting and Time stretching. We will see how these affects our original sample using spectrogram and play these altered audio files.
Visualizing original sample
We will use librosa to read the .wav file and matplotlib to generate spectrogram of wav file. Below is the code for visualization
Noise Addition
This process involves addition of noise i.e. white noise to the sample. White noises are random samples distributed at regular intervals with mean of 0 and standard deviation of 1.
For achieving this we will be using numpy’s normal method generate above distribution and add it to our original sample:
Time Shifting
Here we shift the wave by sample_rate/10 factor. This will move the wave to the right by given factor along time axis.
For achieving this I have used numpy’s roll function to generate time shifting.
Time Stretching
The process of changing the speed/duration of sound without affecting the pitch of sound. This can be achieved using librosa’s time_stretch function.
Time_stretch function takes wave samples and a factor by which to stretch as inputs. I found that this factor should be 0.4 since it has a small difference with original sample.
Pitch Shifting
It is an implementation of pitch scaling used in musical instruments. It is a process of changing the pitch of sound without affect it’s speed.
Again we are going to use librosa’s pitch_shift function. It takes wave samples, sample rate and number of steps through which pitch must be shifted. I found that number of steps between -5 to 5 are much favorable as per our dataset.
You can find the complete jupyter notebook for audio data augmentation here
Congratulations you have successfully performed data augmentation on audio dataset.
If you found this article helpful, hit the clap button and help me get it to more people who need help getting started in data augmentation. I appreciate your responses as well. You can also find me on Twitter