MFCC’s Made Easy

5 min readJun 15, 2019

MFCC’s Made Easy

I’ve worked in the field of signal processing for quite a few months now and I’ve figured out that the only thing that matters the most in the process is the feature extraction. And over the years it’s a proven thing that MFCC(Mel Frequency Cepstral Coefficients) have helped a lot in the feature extraction process.

The major problem that I faced while learning about MFCC’s was to find a source that would provide me with a direct solution to all of my questions for instance what do we exactly mean by these coefficients and the process to find the same etc etc.

So, the aim of me writing this article is to simplify the concept of feature extraction through MFCC! So, Let’s get started!

MFCCs are a compact representation of the spectrum(When a waveform is represented by a summation of possibly infinite number of sinusoids) of an audio signal.

The first question that comes to our mind is How can we convert the audio signal into coefficients and what do these coefficients exactly represent ?

MFCC coefficients contain information about the rate changes in the different spectrum bands.

If a cepstral coefficient has a positive value, the majority of the spectral energy is concentrated in the low-frequency regions. On the other hand, if a cepstral coefficient has a negative value, it represents that most of the spectral energy is concentrated at high frequencies.
We usually take the mean of these values to feed them into a network as an array to predict audio labels… Well that’s a different process…

If this seems too much just keep reading, You’ll get a hang about it as we proceed with the extraction process I guarantee!

The MFCC feature extraction process is basically a 6-step process:

Frame the signal into short frames :
We need to split the signal into short-time frames. The rationale behind this step is that frequencies in a signal change over time, so in most cases it doesn’t make sense to do the Fourier transform across the entire signal in that we would lose the frequency contours of the signal over time. Frame the signal into 20–40 ms frames. 25ms is standard. This means the frame length for a 16kHz signal is 0.025*16000 = 400 samples with a sample hop length of 160 samples.

2.Windowing : Windowing is essentially applied to notably counteract the assumption made by the Fast Fourier Transform that the data is infinite and to reduce spectral leakage.

3.Calculation of the Discrete Fourier Transform.
We can now do an NN-point FFT on each frame to calculate the frequency spectrum, which is also called Short-Time Fourier-Transform (STFT), where NN is typically 256 or 512, NFFT = 512 and then compute the power spectrum (periodogram).

Periodogram : An estimate of the spectral density of a signal.

4.Applying Filter Banks :
This is the step that most of the students find difficult to understand..
The Mel spaced Filter Bank as stated formally is a set of 20–40 triangular filters. Two adjacent filters are described below:
I hope this gives you a clear picture :

Filter Bank on a Mel scale (Highly zoomed in)

Our filterbank comes in the form of 40 vectors of length 257 (assuming the FFT settings fom step 2). Each vector is mostly zeros, but is non-zero for a certain section of the spectrum. To calculate filterbank energies we multiply each filterbank with the power spectrum, then add up the coefficents. Once this is performed we are left with 40 numbers that give us an indication of how much energy was in each filterbank.

To Get the filterbanks shown in the above figure:
a) we first have to choose a lower and upper frequency. Good values are 300Hz for the lower and 8000Hz for the upper frequency.
b) convert the upper and lower frequencies to Mels. In our case 300Hz is 401.25 Mels and 8000Hz is 2834.99 Mels.
The Frequency to MEL conversion is super easy as every formula is available for it!
c)This gives us 40 coefficients(according to requirement, can be any number), between the selected range.
d)These coefficients are then converted back to hertz
e)And the filter bank is plotted using these points!

After applying the Filter Banks we are left with the following spectrogram.

5. We now apply the log of these spectrogram values to get the log filterbank energies.

6.Don’t forget to mention that this process is optional and can be ignored but the resultant is still better if this is applied.
The issues with this spectrogram is that these Filter bank coefficients are highly correlated So, we need to decorrelate these coefficients.So for this DCT (Discrete cosine transform) is applied.Also to mention that the MFCC feature vector describes only the power spectral envelope of a single frame.

The resultant list of numbers or the coefficients are termed as the MFCC’s or the Mel Frequency Cepstrum Coefficients!

I hope I’ve made things clearer for you all!

If it helped you do give me a clap!!

Till then enjoy Signal Processing!!
Cheers!!

References :

Practical Cryptography

Mel Frequency Cepstral Coefficents (MFCCs) are a feature widely used in automatic speech and speaker recognition. They…

practicalcryptography.com

Speech Processing for Machine Learning: Filter banks, Mel-Frequency Cepstral Coefficients (MFCCs)…

Speech processing plays an important role in any speech system whether its Automatic Speech Recognition (ASR) or…

haythamfayek.com

Practical Cryptography

Mel Frequency Cepstral Coefficents (MFCCs) are a feature widely used in automatic speech and speaker recognition. They…

Speech Processing for Machine Learning: Filter banks, Mel-Frequency Cepstral Coefficients (MFCCs)…

Speech processing plays an important role in any speech system whether its Automatic Speech Recognition (ASR) or…

Written by Tanveer Singh