Mel-frequency cepstral coefficients (MFCCs) Explained

Muhy Eddin Za'ter
8 min readNov 21, 2022

--

Feature extraction is one of the most important steps in developing any machine learning or deep learning model. Machine learning (ML) provides a dense representation of the content by extracting characteristics from the raw data. This forces us to learn the core information without the noise to make inferences (if it is done correctly).

Natural Language Processing (NLP) has many definitions and terminologies but briefly speaking; it is the subfield of computer science that aims to develop the ability of computers and machines to understand the human language in its written and spoken forms. One of the oldest and most important applications of NLP is Automatic Speech Recognition (ASR), which converts any spoken form to its corresponding text.

While this task is quite easy for most humans, it is very challenging for a machine to perform this task; and one of the main reasons for that is the complicated nature of speech; therefore, feature extraction from the speech is a task that has haunted researchers and pioneers for a long time. This article aims to explain one of the most well-known methods to extract from speech; known as Mel-frequency cepstral coefficients (MFCCs).

First of all, in speech recognition, the goal is to use an acoustic and linguistic model to determine the appropriate word order to match input audio.

Our observation X is represented by a series of acoustic feature vectors (x1, x2, x3,…) to build an acoustic model. This article goes through how audio characteristics are retrieved from human speech.

Requirements

Let’s first discuss some of the specifications for feature extraction in ASR. We are utilizing a sliding window that is 25 ms wide to extract audio features from an audio segment as seen in the figure below:

The choice of 25 ms is due to the fact that the features in this frame should stay mostly stationary since the 25 ms width is wide enough for us to acquire adequate data. If we speak 3 words per second with 4 phones and each phone will be sub-divided into 3 stages, then there are 36 states per second or 28 ms per state. Therefore, the 25ms timeframe is roughly correct.

In speech, context is crucial. The articulation before and after a phone affects how words are pronounced. We can catch the dynamism between frames to capture the right context because each slid window is spaced roughly 10 milliseconds apart; meaning there is 15 ms overlap between the sliding frame.

Each person’s pitch is unique. However, this doesn’t really matter in terms of understanding what they said. Pitch and F0 are connected. It shouldn’t be used for voice recognition and should be taken out. The formants F1, F2, F3, are more crucial. For those that have problems following these terms, we suggest you read this article.

Additionally, we anticipate that the traits that are recovered will be unaffected by the speaker’s identity or environmental noise. Also, we want extracted characteristics to be independent of one another, just as in any ML problem, and hopefully these features will be as compact as possible. With autonomous characteristics, model development and training are simpler.

There are 39 features in the most common feature extraction technique (MFCC). We must understand the audio’s information because there aren’t many features. The amplitude of frequencies is dependent on 12 factors. We have enough frequency channels to evaluate the audio thanks to it.

Below is the flow of extracting the MFCC features.

The key objectives from MFCC are: Remove vocal fold excitation (F0) — the pitch information. Make the extracted features independent, adjust to how humans perceive loudness and frequency of sound and capture the dynamics of phones (the context).

Mel-frequency cepstral coefficients (MFCC) step-by-step explanation

A/D conversion

A/D conversion digitizes the content by sampling the audio segments and turning the analog signal into discrete space. Most commonly, sampling frequencies of 8 or 16 kHz are employed.

Pre-emphasis

The amount of energy in the high frequencies is increased via pre-emphasis. Vowels and other voiced segments have greater energy at lower frequencies than at higher frequencies. The acoustic model has better access to information in higher formants as the high-frequency energy is increased. This increases the precision of phone detection. When we can’t hear these high-frequency noises, we begin to have hearing issues. Noise has a high frequency as well. Pre-emphasis is a technique used in engineering to reduce the system’s sensitivity to noise that is later added to the process.

Pre-emphasis uses a filter to boost higher frequencies.

Windowing

Windowing involves the slicing of the audio waveform into sliding frames as in the figure below:

However, we cannot just cut it off at the frame’s edge. There will be a lot of noise produced by the abrupt reduction in amplitude, which will be audible at high frequencies. The amplitude should progressively decrease towards a frame’s boundary in order to slice the audio.

A few alternatives for cropping the signal are the Hamming window and the Hanning window. In Hamming and Hanning window, the amplitude drops off near the edge.

In comparison to a rectangular window, the chopped frame with Hamming and Hanning better preserves the original frequency information with less noise.

Discrete Fourier Transform (DFT)

Next, we apply DFT to extract information in the frequency domain.

Mel filterbank

Humans hear loudness differently depending on the frequency. Additionally, as frequency rises, perceived frequency resolution declines. People are less sensitive to higher frequencies, for example. The Mel scale transfers the recorded frequency to that which humans experience.

Triangular band-pass filters are used in feature extraction to transform frequency information into a form that closely resembles human perception.

We start by squaring the DFT output. We refer to this as the DFT power spectrum since it displays the speech’s power at each frequency (x[k]2). We convert it to a Mel-scale power spectrum by using these triangular Mel-scale filter banks. Each Mel-scale power spectrum slot’s output corresponds to the energy it covers throughout a number of frequency bands.

As we mentioned earlier, because human hearing is less sensitive to high frequencies, the Trainangular bandpass is broader at these frequencies. In particular, it is linearly spaced up to 1000 Hz and then turns logarithmically.

All of these initiatives aim to simulate how the basilar membrane in our ear detects sound vibration.

Log

A power spectrum is the output by the Mel filterbank. Compared to minor changes at low energy levels, humans are less sensitive to small energy changes at high energy levels. Therefore it is actually logarithmic. So the next step will take the log out of the output of the Mel filterbank. This also reduces the acoustic variants that are not significant for speech recognition.

Cepstrum — IDFT

The first four letters of the word “spectrum” are reversed in cepstrum. The Cepstral, which divides the glottal source and the filter, must be computed next. The spectrum is shown in diagram (a), with magnitude shown along the y-axis. Diagram (b) uses the magnitude’s log. If you look closely, the wave varies between 1000 and 2000 by roughly 8 times. In reality, it varies by roughly 8% for every 1000 units. The source vibration of the vocal folds is around 125 Hz.

As can be seen, the log spectrum (shown in the first diagram below) is made up of information about the pitch and the phone (the third diagram). The formants that separate phones are indicated by the peaks in the second figure. But how can we tell them apart?

Recall that periods in the time or frequency domain is inverted after transformation.

Remember that the frequency domain comprises brief intervals for the pitch information. To distinguish the formants from the pitch information, we can use the inverse Fourier Transform. The pitch information will appear in the centre and right sides, as seen below. The peak in the centre really corresponds to F0, while the far left will include information about phones.

Therefore, we may ignore the other coefficients and just use the ones on the far left for voice recognition. Actually, the first 12 cepstral values are all that the MFCC uses. These 12 coefficients have another significant characteristic. The log power spectrum is symmetric and real. A discrete cosine transformation is what its inverse DFT is comparable to (DCT).

Dynamic features (delta)

MFCC has 39 features. We finalize 12 and what are the rest. The 13th parameter is the energy in each frame. It helps us to identify phones.

Context and dynamic information are crucial to pronunciation. The transitions in the formant can be used to identify articulations, such as stop closures and releases. The context for a phone is provided by describing feature changes over time. The delta values d(t) below are calculated using another 13 values. It assesses how characteristics have changed between the previous and subsequent frames. This is the features’ first-order derivative.

The final 13 parameters represent the dynamic changes in d(t) between the previous and subsequent frames. It performs the role of c’s second-order derivative (t).

Therefore, the 12 Cepstrum coefficients and the energy term make up the 39 MFCC feature parameters. Then, we have 2 more sets that correspond to the values of the delta and the double delta.

Cepstral mean and variance normalization

We can then carry out feature normalization. We split the characteristics by their variance after normalizing them with their mean. With the feature value j over all the frames of a single utterance, the mean and variance are calculated. This enables us to change values to account for variations in each recording.

However, this could not be accurate if the audio sample is brief. Instead, depending on speakers or perhaps the full training dataset, we might compute the average and variance values. The pre-emphasis will be effectively undone by this kind of feature normalization. We extract MFCC characteristics in this manner. Last but not least, MFCC is not extremely noise-resistant.

--

--

Muhy Eddin Za'ter

A machine learning engineer in the field of energy, but overthinking is what I do best.