Audio Data Processing— Feature Extraction — Essential Science & Concepts behind them — Part I

Vasanthkumar Velayudham |
Analytics Vidhya
Published in
6 min readApr 7, 2020


Audio Signal Processing — src

Note: Part 2 of this series with working code explanation is available here.

There are quite a few useful blogs available over internet that explains the concepts behind processing Audio data towards feature extraction activities for various applications of deep learning. These blogs are highly informative, but still I learnt new things with regard to feature extraction process in audio file and this blog series is to summarize those understanding to the fellow enthusiasts.

First of all, concepts around audio signal processing is bit complex in comparison with image processing — as visualizing audio data is not as easy as with image. However, the concept is fairly similar and there are very powerful libraries available in python such as librosa — which does most of the tasks for us. Python notebook used in this article could be found here.

Theory behind Audio

Lets begin with some theory, what is audio wave? Audio waves are the vibration of air molecules whenever any sound happens and sound travels from originator to the receiver in the form of wave.

As such, this wave has 3 properties to it — Amplitude, Frequency and Time.

→ Amplitude represents the magnitude of the wave signal and it is usually measured in decibels (dB).

→ Time is the time scale, as we all know it.

→ Frequency represents how many complete cycle the wave takes in one second and it is measured in Hz.

3D view of Audio Wave. src

Every living being has a different hearing range of sound wave. We (humans) could hear sound waves that are in the range of 20 Hz to 20,000 Hz. Dogs could hear the sound waves upto 45K Hz and dolphins could hear until 150K Hz.

Hearing range of diff animals. src

As human hearing range is around 20K Hz, sampling rate of audio files in many libraries are by default set at 22050 per sec. We do have an option of increasing/decreasing it as per the need.

Wait — first of all what is sample rate in audio? Consider you have a audio recorder, there is a music going on and you would like to record the music. Note audio waves are continuous in nature — but most of our processing engines are built to process digital/discrete signals.

Sampling rate.

When you initiate the recording — recorder records the magnitude of audio signal at a very high rate in the range of 22K values per second. Say, you have recorded the audio for 5 seconds — then the audio file will contain (22K * 5) magnitude values of recorded signal. Thus sampling rate corresponds to the rate at which these audio wave are recorded per second.

Also, note that sound waves that we deal with are usually a combination of 1000s of individual wave signals — each of them at different frequency. To put it in perspective, consider you are listening to a concert where guitar, drums and key board are played in tandem. When we record this audio — each instrument would generate different audio waves and the one that we hear is the consolidated audio wave of 1000’s of individual waves generated by the respective instruments.

While processing audio data, this consolidated audio wave would be segregated into individual waves at its respective frequency. This process is very much critical to view the audio wave in 3 dimensions (Amplitude, Frequency and Time) and a concept of ‘Fast Fourier Transformation (FFT)’ is used for this purpose. We are not getting into the theory of what is ‘Fourier Transformation’ — but we will see how powerful is this in next section.

Enough of theory and lets work on few wave data to understand the above concept. Towards understanding ‘Fast Fourier Transform’ concept, this blog contains excellent explanation — I will be reproducing the same in my language, along with the code.

Lets generate a simple sine wave with the frequency of 3Hz and magnitude of 1 unit — as below:

Python code to generate sine wave with frequency of 3 Hz

This will generate the sine wave, as below:

As explained above, FFT is used to obtain the frequency of the given wave along with the magnitude. Lets write the function to extract the frequency from the above wave using sklearn:

When we pass the above wave value into this ‘fft_plot’ function, it would identify the frequency of the wave along with the magnitude.

Frequency value identified by ‘fft_plot’ function

You can try the same function with different values of frequency and magnitude waves and observe ‘fft_plot’ function identifies the frequency value appropriately.

Now, lets consider you have sound wave which is a combination of two waves and we want to see how ‘FFT’ would help us here. Lets move on.

Fast Fourier Transform

Generate a sine wave with frequency of 11Hz and magnitude of 2 units. Once generated, sine wave would look as shown.

Now add both the sine waves with one another and obtain the consolidated one. Consolidated sine wave would look as the one shown below.

Consolidated sine wave:

Consolidated sine wave of 2 different frequencies

Once we feed the consolidated wave, FFT function would help us in separating both the waves and would find the frequency of the respective waves as below:

FFT helps in identifying the frequency of individual waves present in consolidated wave

As explained earlier, real time sounds that we hear are the combination of 1000’s of individual waves and with the application of FFT we could extract them into their respective frequency. Lets try this on a real time audio.

When we plot the audio on wave plt, this is how the audio wave looks like:

Wave plot of reat time sample audio

Now, lets pass on this audio wave to FFT function and observe how many individual frequency wave that this audio comprise of:

FFT output of sample audio wave

As you observe in the above plot, FFT function has identified more than 8000 individual waves in the given audio, with the waves in the rnage of 100–1000Hz having higher magnitude.

To summarize:

→ Every sound wave has 3 different dimension to it — Amplitude, Frequency, Time.

→ Real time audio that we listen are the combination of 1000s of individual sound waves.

→ To process audio file, we would be required to process audio data in 3 dimensions.

→ Fourier transform is used to extract the information about individual waves from consolidated wave.

Once the frequency information are extracted — we can visualize the sound wave in 3D using Spectrogram, which looks like below:

Sample spectrogram of audio file

This is the part 1 of the series and in the next post, we will discuss in detail about Mel Frequency Coefficients and how audio data is getting transformed during the feature extraction process. Please watch out. Thanks!