Simplifying Audio Data: FFT, STFT & MFCC

Ankur Dhuriya
Analytics Vidhya
Published in
6 min readJun 27, 2020

--

What we should know about sound. Sound is produced when there’s an object that vibrates and those vibrations determine the oscillation of air molecules which creates an alternation of air pressure and this high pressure alternated with low pressure causes a wave.

Some key terms in audio processing.

  • Amplitude — Perceived as loudness
  • Frequency — Perceived as pitch
  • Sample rate — It is how many times the sample is taken of a sound file if it says sample rate as 22000 Hz it means 22000 samples are taken in each second.
  • Bit depth — It represents the quality of sound recorded, It just likes pixels in an image. So 24 Bit sound is of better quality than 16 Bit.

Here I have used the sound of a piano key from freesound.org

signal, sample_rate = librosa.load(file, sr=22050)
plt.figure(figsize=FIG_SIZE)
librosa.display.waveplot(signal, sample_rate, alpha=0.4)
plt.xlabel(“Time (s)”)
plt.ylabel(“Amplitude”)
plt.title(“Waveform”)
plt.savefig(‘waveform.png’, dpi=100)
plt.show()

To move wave from a time domain to frequency domain we need to perform Fast Fourier Transform on data. Basically what we do with the Fourier transform is the process of decomposing a periodic sound into a sum of sine waves which all vibrate oscillate at different frequencies. It is quite incredible so we can describe a very complex sound as long as it’s periodic as a sum as the superimposition of a bunch of different sine waves at different frequencies.

Below I have shown how two sine waves of different amplitude and frequency are combined into one.

# perform Fourier transform
fft = np.fft.fft(signal)
# calculate abs values on complex numbers to get magnitude
spectrum = np.abs(fft)
# create frequency variable
f = np.linspace(0, sample_rate, len(spectrum))
# take half of the spectrum and frequency
left_spectrum = spectrum[:int(len(spectrum)/2)]
left_f = f[:int(len(spectrum)/2)]
# plot spectrum
plt.figure(figsize=FIG_SIZE)
plt.plot(left_f, left_spectrum, alpha=0.4)
plt.xlabel(“Frequency”)
plt.ylabel(“Magnitude”)
plt.title(“Power spectrum”)
plt.savefig(‘FFT.png’)
plt.show()

By applying the Fourier transform we move in the frequency domain because here we have on the x-axis the frequency and the magnitude is a function of the frequency itself but by this we lose information about time so it’s as if this a special power spectrum here was a snapshot of all the elements which concur to form this sound, so basically what this spectrum is telling us is that these different frequencies have different powers but throughout all of them all of the sound here so it’s a snapshot it’s a static which could be seen as a problem because obviously audio data alike is a time series right so things change in time and so we want to know about how things change in time and it seems that with the Fourier transform we we can’t really do that so we are missing on a lot of information right but obviously there’s a solution to that and the solution it’s called the Short Time Fourier Transform or STFT and so what the short time Fourier transform does it computes several Fourier transforms at different intervals and in doing so it preserves information about time and the way sound evolved it’s over time right and so the different intervals at which we perform the Fourier transform is given by the frame size and so a frame is a bunch of samples and so we fix the number of samples and we say let’s consider only for example 200 samples and do the Fourier transform there and then let’s move on to let’s shift and move on to to the rest lack of the waveform and what happens here is that we get a spectogram which gives us information of (time + frequency + magnitude)

# STFT -> spectrogram
hop_length = 512 # in num. of samples
n_fft = 2048 # window in num. of samples
# calculate duration hop length and window in seconds
hop_length_duration = float(hop_length)/sample_rate
n_fft_duration = float(n_fft)/sample_rate
print(“STFT hop length duration is:{}s”.format(hop_length_duration))
print(“STFT window duration is: {}s”.format(n_fft_duration))
# perform stft
stft = librosa.stft(signal, n_fft=n_fft, hop_length=hop_length)
# calculate abs values on complex numbers to get magnitude
spectrogram = np.abs(stft)
# display spectrogram
plt.figure(figsize=FIG_SIZE)
librosa.display.specshow(spectrogram, sr=sample_rate, hop_length=hop_length)
plt.xlabel(“Time”)
plt.ylabel(“Frequency”)
plt.colorbar()
plt.title(“Spectrogram”)
plt.savefig(‘spectogram.png’)
plt.show()
# apply logarithm to cast amplitude to Decibels
log_spectrogram = librosa.amplitude_to_db(spectrogram)
plt.figure(figsize=FIG_SIZE)
librosa.display.specshow(log_spectrogram, sr=sample_rate,
hop_length=hop_length)
plt.xlabel(“Time”)
plt.ylabel(“Frequency”)
plt.colorbar(format=”%+2.0f dB”)
plt.title(“Spectrogram (dB)”)
plt.savefig(‘spectogram_log.png’)
plt.show()

we have time here on the x-axis but we also have frequency on the y-axis and we have a third axis which is given by the colour and the colour is telling us how much a given frequency is present in the sound at a given time so for example here we see that low-frequency sound is more in the most of the audio.

Mel Frequncy Cepstral Spectogram in short MFCC’s capture many aspects of sound so if you have for example a guitar or flute playing the same melody you would have potentially same frequency and same rhythm more or less there depending on the performance but what would change is the quality of sound and the MFCC’s are capable of capturing that information and for extracting them MFCC’s we perform a Fourier transform and we move from the time domain in so you the frequency domain so MFCC’s are basically frequency domain feature but the great advantage of MFCC’s over spectrograms is that they approximate the human auditory system they try to model the way we perceive frequency right and so that’s very important if you then want to do deep learning stuff to have some data that represent the way we kind of process audio now the results of extracting MFCC’s is a bunch of coefficients it’s an MFCC vector and so you can specify a number of different coefficients usually in all your music applications you want to use between 13 to 39 coefficients and then again you are going to calculate all of these coefficients at each frame so that you have an idea of how the M FCC’ are evolving over time right.

# MFCCs
# extract 13 MFCCs
MFCCs = librosa.feature.mfcc(signal, sample_rate, n_fft=n_fft,
hop_length=hop_length, n_mfcc=13)
# display MFCCs
plt.figure(figsize=FIG_SIZE)
librosa.display.specshow(MFCCs, sr=sample_rate,
hop_length=hop_length)
plt.xlabel(“Time”)
plt.ylabel(“MFCC coefficients”)
plt.colorbar()
plt.title(“MFCCs”)
plt.savefig(‘mfcc.png’)
plt.show()

so here I have 13 MFCC’s coefficient represented in the y-axis, time in the x-axis and more the red, more is the value of that coefficient in that time frame.

MFCC’s are used for a number of the audio application. Originally they have been introduced for speech recognition, but it also has uses in music recognition, music instrument classification, music genre classification.

Link to code:

Code for sine waves

Code for FFT, STFT and MFCC’s

--

--