Audio Data Processing — Feature Extraction — Essential Science & Concepts behind them — Part 2

Vasanthkumar Velayudham |
Analytics Vidhya
Published in
5 min readApr 17, 2020
Audio Processing — Part 2, src — link

Note: Part 1 of this series with the concepts explained in detail is available here.

In the previous part, we have seen in detail about various theoretical concepts associated with audio data processing. In this post, we will get into the processing aspects directly and see how does the features are extracted from audio files.

Lets get started directly with the loading of audio file and we will use the famous ‘librosa’ library for this purpose.

As you see, we have imported the library and loaded the audio file. Upon loading, load function returns 2 values — x and sr.

Here, ‘x’ corresponds to number of samples captured in the audio and ‘sr’ corresponds to the sampling rate (which is number of amplitudes sampled per second).

Lets looks at the values of ‘x’ and ‘sr’.

No of samples in the audio file is 3104256 and the sampling rate is 22050.

Lets find out the duration of the audio file:

duration of audio = no of samples/sampling rate = 3104256/22050 = 140.78 seconds

As per our calculation, audio file duration should be 140 seconds. Lets have it verified. And, yes it is 140 seconds (2 min, 20 seconds — as shown below).

Now, lets plot the wave plot of the audio file — which is a plot of amplitude vs time.

It looks good — but in our previous article, we have studied that the audio signals are 3 dimension and there is an another important dimension called frequency.

How to extract the frequencies of different wave signals from this plot? Do you guys remember — that we discussed in detail about Fast Fourier Transform (FFT) in our previous article? .

Lets apply FFT and extract the frequency related information. Librosa has a powerful api to apply FFT and it would not take more than few lines to extract frequencies — as below:

Wait!!! We passed on the ‘x’ value (which is total no of frames in audio files) to ‘stft’ function of librosa and it provides some 2D matrix as output with some dimensions. What is happenning?!?!

This is where it turns bit bumpy. Let me explain what is happening here.

As you observed, this is 140 seconds audio clip and we cannot pass the entire audio to FFT function at once. This will be too much for the calculation. Hence we need to split the audio clips into various windows of constant frame length and have them cumulatively processed. We call this length as ‘window length’.

As it is an audio file and to preserve the continuity while processing — we let some frames from subsequent windows to overlap with one another and the actual unique count of frames processed per window is known as ‘hop length’.

As shown here, every window contains ‘hop length’ + ‘overlap window’ number of frames present.

Lets get back to our data and calculation. We have an audio file with duration of 140 seconds and 3104256 frames. Once we feed it to FFT with ‘hop_length’ as 512 and ‘n_fft’ as 4096, we obtained a result with (2049, 6064) dimensions.

Here, ‘2049’ correspond to the value of (1 + n_fft/2) — which is (1 + (4096/2)).

Similarly, in the result ‘6064’ correspond to (‘total # of frames’/’hop_length’ rounded off to nearest power of 2) — which is in our case (3104256/512=6063).

Mel Frequency Cepstral Coefficients (MFCC)

Those who would have dealt with audio processing would have come across the MFCC coefficients, but what they are? Before getting there, we need to understand what is ‘mel scale’.

As you know, human’s hearing range is only between 20Hz-20KHz. Any sound waves which are not falling within this range are not critical with regard to most of the audio processing applications. So, special considerations need to be provided to the waves that fall in this range over the other sound waves. Also, human’s perception of hearing the difference between ‘100 Hz’ and ‘200 Hz’ sound wave is much higher in comparison with identifying difference between ‘10100 Hz’ and ‘10200Hz’, though the difference in Hz value is same.

So, mel scale is a non-linear transformation scale where it transforms the frequency range of audio to a different value range — whose difference would sound identical to the end user irrespective of values.

Considering this, processing our audio frequencies with mel scale transformations would make more sense and we could convert our audio to ‘mfcc’ coefficients with the below code:

Here the dimensions of the output is (20, 6064). Value 6064 has been calculated in the same way with the value of ‘hop_length’, whereas 20 rows correspond to the various ranges of coefficients corresponding to frequency range of audio file.

You can find the code describing the above explanations here.

This concludes our 2 part series on audio processing.

Please feel free to comment your queries and happy learning!!

References:

--

--