Computing Mel Frequency Cepstral Coefficients without filter-banks and beyond

Asheesh Sharma
15 min readFeb 13, 2020

--

AAudio feature engineering is one of the most interesting subjects in speech processing. As one of the popular features, Mel Frequency Cepstral Coefficients (MFCCs) have been used to achieve state of the art results in many different tasks of speech processing. In a nutshell, MFCCs are computed using filter-banks which are designed to mimic the human perception of speech. Filter-banks are used to sample power spectrums computed using Short-Time Fourier Transform on overlapping frames obtained by slicing the signal. To calculate the MFCCs, a Discrete Cosine Transform (DCT) is applied to the logarithm of the result. Some number of the resulting coefficients are retained while the rest is discarded.

From the perspective of compression, the process of computing MFCCs is a lossy one. Cosine transform which is used to decorrelate the filter bank coefficients consequently whitens the spectrum [1]. However, this discards the highly non-linear components of the signal because of the transform’s linear nature.

Back in the days of Gaussian Mixture models, this was a necessary step because such models were quite susceptible to highly correlated input and even more so in noisy conditions. Furthermore, the efficiency of MFCCs can also be challenged at the filterbanks step as well. A disadvantage of filter-banks is that they almost always take more calculation and processing time than discrete Fourier analysis using the FFT [2]. Another crucial aspect is that the phase information is not considered in the MFCCs [4]. More about the disadvantages and the details of the traditional method can be found here and here respectively.

In some cases, where we want an algorithm to be robust to individual voice intonations, the traditional approach of calculating MFCCs (see Figure 1) uses Vocal Tract Length Normalisation to transform the power spectrum and remove the contribution of individual intonations. A possible disadvantage of this approach includes discretization and interpolation errors [3].

Figure 1: Comparison of the traditional MFCC computation (left) with the integrated approach (right) investigated in [3]. (Source: [3])

In this article, I am going to implement a technique (see Figure 1) presented by Molau Et al., which skips the calculation of filter-banks and integrates the subsequent steps of the traditional approach into a direct cepstrum transformation [3]. This also allows us to have more robust control over the contribution of intonation in MFCCs. Furthermore, I will compare this with the traditional in terms of signal reconstruction.

But before I do that, we should briefly discuss some enhancement techniques commonly used in conjunction with audio feature engineering in general.

Signal pre-processing

Essentially, sound is rapid pressure wave travelling through a media. Here we are interested in processing such waves in digital domain which have been observed and sampled using microphones at a certain sampling rate.

Framing, Voice activity detection, and Windowing

For our analysis, we are going assume that there exist a small enough number of samples in a particular speech signal for which the signal itself is time invariant. Intuitively, a spoken sentence consists of phonemes which are stationary in nature (try saying æ). Therefore, important information can be extracted by splitting the signal into sufficiently small segment or frames such that each segment just contains a single phoneme. This process is called framing.

Another advantage of framing a signal is that it helps in preserving locally important frequency contours. Since the frequencies also change over time, doing a Fourier transform over the entire signal makes little sense.

Voice activity detection

Voice activity detection (VAD), also known as speech activity detection or speech detection, is a technique used in speech processing in which the presence or absence of human speech is detected. The main uses of VAD are in speech coding and speech recognition. It can facilitate speech processing, and can also be used to deactivate some processes during non-speech section of an audio session: it can avoid unnecessary coding/transmission of silence packets in Voice over Internet Protocol applications, saving on computation and on network bandwidth. [5]

VAD is an important choice to make because it helps in reducing the amount of computation required to process a given signal. Discarding frames that do not contain much information also helps in strengthening the rationale of frames being stationary for Fast Fourier Transform.

Noise Suppression

Noise suppression is a pretty old topic in speech processing, dating back to at least the 70s. [6]

As the name implies, the idea is to take a noisy signal and remove as much noise as possible while causing minimum distortion to the speech of interest. This is also out of the scope for this study but it is also essential for maintaining a good sound to noise ratio.

Here I am using my own framework written in C++ to create frames, suppress noise and voice activity detection. However similar techniques and libraries are available all over the internet [6] [7].

Windowing

After slicing the signal into frames, we apply a window function such as the Hamming window to each frame. A Hamming window has the following form:

where 𝑁 is the window length.

Windowing for processing applications; Overlap-add

Windowing is a classical method in signal processing and it refers to splitting the input signal into temporal segments. The borders of segments are then visible as discontinuities, which are incongruent with the real-world signal. To reduce the impact of segmenting on the statistical properties of the signal, we apply windowing to the temporal segments. Windowing functions are smooth functions which go to zero at the borders. By multiplying the input signal with a window function, the windowing function also goes to zero at the border such that the discontinuity at the border becomes invisible. Windowing does thus change the signal, but the change is designed such that its effect on signal statistics is minimised. There are two distinct applications of windowing with different requirements; 1) analysis and 2) processing. In analysis, we only care about extracting information as accurately as possible given computational constraints, while in processing applications, we in addition need the ability to recreate the signal from a sequence of windows…
When we intend to modify the windowed signal with some processing, the most common approach is to use a technique known as overlap-add. As seen in the figure on the right, in overlap-add, we extract overlapping windows of the signal, apply some processing, and reconstruct by windowing a second time and then adding overlapping segments together.
An obvious requirement would then be that if the signal is not modified, that we could then reconstruct the original signal perfectly; known as the perfect reconstruction property. It is straightforward to demonstrate that perfect reconstruction is achieved if overlapping regions of the windowing function add up to unity. Note that here we need to take into account the windowing is applied twice. That is, we obtain perfect reconstruction if Princen-Bradley criteria is satisfied
[10]:

This might come as a surprise but most software packages (Matlab, python) do not consider this.

Figure 2: As can be seen, the equal end-points sum to form an impulse in each frame of the overlap-add for both the cases of 𝑁 being odd or even. The blue (𝑁=33) and black (𝑁=34) curves indicate the reconstruction (overlap-add) which should be flat at the equal end-points.
Figure 3: Rectification of the impulses.
Gist 1: Function to make numpy.hamming work according to the Princen-Bradley criteria.

For example, when using windows defined in a NumPy library, the Princen-Bradley criteria condition should be carefully checked. For example, Figure 2 shows that there is a problem with the standard Hamming window. This problem can be solved by understanding how numpy.hamming works under the hood. It outputs a Hamming window which is symmetric. To account for the window length 𝑁 being even; one can calculate a Hamming window of length 𝑁+1; after which the last sample can be dropped. In case of 𝑁 being odd, however, the window can be calculated with 𝑁 samples; after which the last and first samples can be divided by two.

An important advantage of this is that it also computes the optimal number of overlapping samples between the frames for the framing.

Using this equation in Gist 1, the impulses at the equal end can be rectified as shown in the Figure 3.

Figure 4: End result of the pre-processing in time domain.
Gist 2: The processing pipeline.

In Gist 2, I am using a 16-bit PCM wav, called OSR_us_000_0010_8k.wav, which has a sampling frequency of 8000 Hz [8]. The wav file is a clean speech signal comprising a single voice uttering some sentences with some pauses in-between. For simplicity, I used the first 3.5 seconds of the signal which corresponds roughly to the first sentence in the wav file. Firstly, I calculate the hamming window using the Gist 1 which also returns a value R which I then use to calculate the ideal window shift based on the duration of every frame in milliseconds. Then, I use a custom voice activity detector that I wrote in C++ which has several other functionalities such as inbuilt noise suppression and signal slicing. The vadHandle object takes several parameters such as noiseSupression, frameDurationMs, and windowShiftMS(which should be frameShiftMSapologies!).

Finally, I call vadHandle.getVoicedFrames to retrieve the frames along with there timestamps. Finally, I apply the hamming window to the first 3.5 seconds worth of frames.

Figure 5: Pre-processing pipeline commonly used for feature extraction. Signal is first sliced into overlapping frames. Followed by noise suppression and subsequent voice activity detection. Finally the frames are windowed.

With all of the steps above, we have the pre-processing pipeline as shown in Figure 5. The result of this processing (as shown in Figure 4) gives us noise-suppressed voiced frames that can be used for a variety of speech processing tasks.

Fourier-Transform and Power Spectrum

We can now do an N-point FFT on each frame to calculate the frequency spectrum, which is also called Short-Time Fourier-Transform (STFT), where N is typically 256 or 512, NFFT = 512; and then compute the power spectrum (periodogram) using the following equation:

By windowing and taking the discrete Fourier transform (DFT) of each window, we obtain the short-time Fourier transform (STFT) of the signal. Specifically, for an input signal 𝑥ᵢ and window 𝑤ᵢ, the transform is defined as:

The STFT is one of the most frequently used tools in speech analysis and processing. It describes the evolution of frequency components over time. Like the spectrum itself, one of the benefits of STFTs is that its parameters have a physical and intuitive interpretation.

Figure 6: Spectrogram of OSR_us_000_0010_8k.wav
Gist 3: Process of calculating power spectrum

A parallel with a spectrum is that the output of the STFT is complex-valued, though where the spectrum is a vector, the STFT output is a matrix. As a consequence, we cannot directly visualise the complex-valued output. Instead, STFTs are usually visualised using their log-spectra, 20𝑙𝑜𝑔₁₀(𝑋(ℎ,𝑘)). Such 2 dimensional log-spectra can then be visualised like a heat-map known as a spectrogram (Figure 6).

Computing MFCCs Directly On The Power Spectrum

Molau Et al. were very comprehensive in explaining the process. So I am going to directly quote them [3].

We have investigated an alternative method to compute Mel-frequency warped cepstral coefficients directly on the power spectrum and thereby avoid possible problems of the standard approach.
Ignoring any spectral warping for a moment, cepstral coefficients
Cₖ can be derived by Eq. (1):

Depending on whether a filter-bank is used or not, |𝑋(⋅)| stands for either the filter-bank outputs or the power spectrum.

To incorporate warping directly into the cosine transformation, we change the integration variable and use the derivative of the warping function 𝑑𝜔̃ /𝑑𝜔 Eq. (3). The continuous integral is later approximated in the standard way by a discrete sum Eq. (4):

One specific type of frequency warping is the Mel-frequency scaling 𝜇(⋅), which is usually carried out according to Eq. (5) with the sampling frequency fₛ:

For integration into the cosine transformation, the Mel-warping function needs to be normalised in order to meet the 𝜇̃ (𝜋)=𝜋

Replacing 𝑔(⋅) in Eq. (4) by 𝜇̃ (⋅) leads to a compact implementation of MFCC computation with only a few lines of code. A look-up table for constants like the derivative and the cosine term can be precomputed, all that remains is a matrix multiplication on the logarithm of the power spectrum.

Implementation: Computing 𝜇̃ (𝜔) and 𝜇̃ ′(𝜔)

Let us start by implementing the calculation of 𝜇̃ (𝜔) and 𝜇̃ ′(𝜔) using Eq. (6). Replacing 𝑔(⋅) in Eq. (4) by 𝜇̃ (⋅) gives us the following form:

Given 𝜇̃ (𝜔) from Eq. (6), we can calculate its derivative 𝜇̃ ′(𝜔) either numerically using numpy.diff of 𝜇̃ (𝜔) or directly as follows:

Note that using Eq. (6) and Eq. (7), 𝜇̃ (𝜔) and 𝜇̃ ′(𝜔) can be computed separately for a given value of 𝜔 as shown in Gist 4.

Gist 4: Implementation of Equation 6 and 7.

Implementation: Building the transformation matrix 𝐀

From the computation perspective, we can precompute constants like the 𝜇̃ (𝜔) and 𝜇̃ ′(𝜔) and the cosine term in Eq. (4) in to a transformation matrix 𝐀. Firstly, the cosine term in Eq. (4) can formulated in a matrix form 𝐂 as follows:

About 𝑁 and 𝐾

Now, we can calculate 𝜇̃ (𝜔) and 𝜇̃ ′(𝜔) for all values of 𝜔. First, note that, after substitution of 𝑔(⋅) in Eq. (4), 𝜔→𝜔̃ =𝑔(𝜔)=𝑔(2𝜋𝑛𝑁); 𝜔 becomes 2𝜋𝑛𝑁. Where 𝑁 is equal to the number of spectral lines of the discrete Fourier spectrum as mentioned in [9].

In the case of continuous spectra, there may be no upper limit for 𝑁 and 𝐾. We have assumed that the original spectrum can be represented by a finite number of cepstral coefficients, for instance if it has been cepstrally smoothed already. In practice, however, we work with discrete spectra. Hence, 𝑁 and 𝐾 will be finite and equal to the number of spectral lines of the discrete Fourier spectrum. This number can be further reduced for cepstral smoothing.

Gist 5: Calculation of equation 8 and 9

Important: The number of spectral lines is always half of the selected FFT size. In the original paper, the upper limit of 𝑛 is 𝑁/2−1 (see Eq. (4), which resembles the calculation of number of spectral lines 𝑁 in the paper quoted above. It creates a lot of confusion for me since, only one of them can be true. Therefore, I decided to use the term 𝑁 as the number of spectral lines which obviously affects Eq. (4). I have assumed that the summation term which is being divided by 𝑁 is supposed to be divided by the FFT size 𝑁𝑓𝑓𝑡 because that term is meant to normalise 𝑐𝑘 on the entire spectrum whereas the term 2𝜋𝑛𝑁 is meant to normalise the Mel-warping function in order to meet the 𝜇̃ (𝜋)=𝜋 criterion (See Gist 5).

Implementation: integrating the transformation matrix 𝐀 in Eq. (4)

Gist 6: Calculation of equation 10.

Influence of 𝐾 on the cosine matrix 𝐂
According to the original paper, 𝐾 should also be equal to the number of spectral lines in the Fourier transform. However it was noted that 𝐾 can be further reduced for cepstral smoothing. Therefore it is very similar to the number of filters used during the calculation of filter banks in the traditional MFCC extraction approach (see figure 2). Each filter in the filter bank is triangular having a response of 1 at the centre frequency and decrease linearly towards 0 till it reaches the centre frequencies of the two adjacent filters where the response is 0. The non-linear nature of the cosine matrix 𝐂 gives it a non-triangular response in the range [−𝜋,𝜋] (the 𝜔 scale) as opposed to the linear range [0,1] (the mel scale) in triangular filter banks.

Figure 7: Response of C. Notice the resemblance with the filter-banks.

Like filter banks however, 𝐂 has a maximum response at a centre which decreases non-linearly till it reaches the centre frequencies of the two adjacent filters as shown in Figure 7.

Gist 7: Computing MFCC using equation 10.

Since we discard the fast changing coefficients which don’t contribute much to speech recognition, there is no empirical reason to compute for a larger 𝐾=𝑁 as the kept coefficients will remain same. So essentially we keep it equal to the number of filter-banks NFILT=40. And then, we can compute the MFCCs (see Figure 8) as shown in Gist 7.

Figure 8: Resulting MFCCs.

As you can see, in Figure 8, the resulting MFCCs fromlibrosaand the method discussed here are quite similar. In the bottom figure, we can see that higher order coefficients are little bit shifted because of the smooth response (see Figure 7) as opposed to the triangular response in the traditional approach.

Reconstruction of audio from MFCCs

I think Gist 8 is pretty self-explanatory. First, we invert the logarithm operation which gives the power spectrum. Second, we calculate a DCT matrix to invert the filter-bank operation. The dot product of the matrix and MFCCs gives us an approximate squared-magnitude STFT of the original signal. We then multiply the result with the STFT of white noise to mimic excitation. The result is then inverted using ISTFT and stretched in the time domain to the length of the original signal.

Gist 8: Re-constructing audio from MFCCs

Conclusion

Audio 1 and Audio 2 show two speakers (one female, one male) saying “She had your dark suit in greasy wash water all year” reconstructed from direct and traditional method respectively. Notice that Audio 1 already has no perceivable information about gender. On the other hand, Audio 2 still carries some perceivable information about gender. This is an exciting result if we want to use the direct method for applications that need more data just to account for speaker variability. Furthermore, if we wanted to further remove/increase the variability (for applications like speaker diarization), we can apply integrated Vocal Tract Length Normalization which we will see in a different article.

By the way, the direct method is three times faster than librosa yo!

100 loops, best of 3: 12.3 ms per loop
100 loops, best of 3: 38.4 ms per loop
Audio 1: Using MFCCs computed on power spectrum directly
Audio 2: Using MFCCs computed using librosa
100 loops, best of 3: 12.3 ms per loop
10 loops, best of 3: 36.3 ms per loop

Phew! We are practically done here. I am planning to continue this article in a series. I have left out any discussion related to Vocal Tract Length Normalization or how we can introduce phase information in MFCCs. If you would like to see that happen or if you have any suggestions/corrections please reach out to me in the comment or on linkedin. Thank you for your time.

[1] Fayek, H. (2016). Speech Processing for Machine Learning: Filter banks, Mel-Frequency Cepstral Coefficients (MFCCs) and What’s In-Between. [online] Haytham Fayek. Available at: https://haythamfayek.com/2016/04/21/speech-processing-for-machine-learning.html [Accessed 8 Nov. 2019].

[2] Phon.ucl.ac.uk. (2019). Introduction to Computer Programming with MATLAB. [online] Available at: https://www.phon.ucl.ac.uk/courses/spsci/matlab/lect9.html [Accessed 8 Nov. 2019].

[3] Molau, Sirko & Pitz, Michael & Schluter, Ralf & Ney, Hermann. (2001). Computing Mel-frequency cepstral coefficients on the power spectrum. Acoustics, Speech, and Signal Processing, 1988. ICASSP-88., 1988 International Conference on. 1. 73–76 vol.1. 10.1109/ICASSP.2001.940770

[4] WANG, L., MINAMI, K., YAMAMOTO, K. and NAKAGAWA, S. (2010). Speaker Recognition by Combining MFCC and Phase Information in Noisy Conditions. IEICE Transactions on Information and Systems, E93-D(9), pp.2397–2406.

[5] En.wikipedia.org. (2019). Voice activity detection. [online] Available at: https://en.wikipedia.org/wiki/Voice_activity_detection [Accessed 8 Nov. 2019].

[6] Valin, J. (2017). RNNoise: Learning Noise Suppression. [online] People.xiph.org. Available at: https://people.xiph.org/~jm/demo/rnnoise/ [Accessed 8 Nov. 2019].

[7] Wiseman, J. (2019). Python interface to the WebRTC Voice Activity Detector. [online] GitHub. Available at: https://github.com/wiseman/py-webrtcvad [Accessed 8 Nov. 2019].

[8] Voiptroubleshooter.com. (2019). Harvard Sentences American English. [online] Available at: http://www.voiptroubleshooter.com/open_speech/american.html [Accessed 8 Nov. 2019].

[9] Pitz, M. and Ney, H. (2005). Vocal tract normalization equals linear transformation in cepstral space. IEEE Transactions on Speech and Audio Processing, 13(5), pp.930–944.

[10] Wiki.aalto.fi. (2019). Windowing — Introduction to Speech Processing — Aalto University Wiki. [online] Available at: https://wiki.aalto.fi/display/ITSP/Windowing [Accessed 8 Nov. 2019].

--

--

Asheesh Sharma

A creative roboticist with roots in Space exploration, Artificial Intelligence, Bio-inspired design, Swarms, Computer Networks, and Energy harvesting.