Computing MFCCs voice recognition features on ARM systems
LinTO research project is a French PIA “Grands Défis du Numérique” funded by BPI, and supported by SGPI “Secrétariat Général pour l’Investissement” and DGE “Direction Générale des Entreprises”. It aims at supporting the development of the LinTO open-source voice assistance for enterprises. LinTO is based on speech recognition. Speech recognition is a technology that allows spoken input into systems. You talk to your computer, phone or device and it uses what you said as input to trigger some action. That’s what we call Natural Language Processing (NLP). Kaldi, which is a quite known open toolkit, is currently used at Linagora to implement Speech-to-Text algorithms.
Communicate with the machine!
It’s easy for humans to communicate through speech when they speak the same language, but it’s really hard for a computer. The only relevant thing in communication is the real meaning, this is what we call the Linguistic Content. We need to extract features to eliminate parasitic audio components like accents, pronunciations, emotions of speakers, or background noises. The features extraction is the first step of automatic speech recognition systems.
Humans can generate sounds which are filtered by the shape of the vocal tract including the tongue, teeth etc. This shape determines what sound comes out. If we are able to determine the shape, we should be able to get an accurate representation of the phoneme being produced. That’s the job of MFCCs.
MFCCs, What’s this?
MFCCs means Mel Frequency Cepstral Coefficients which are the most widely used features in speech recognition. MFCC coefficients model the spectral energy distribution in a perceptually meaningful way.
Step 1: Cut the signal into several windows that intersect each other. This is called sliding windows, and not those of your kitchen ...
Step 2: In order to reduce the spectral distortion, it is necessary to apply a specific window to the signal. There are many types of windows but the most known are Hamming, Hann (also called Hanning) and Blackman.
Step 3: Apply the FFT (Fast Fourier Transform) to the window to bring out the magnitude, so we get the spectrum. A fast Fourier transform (FFT) is an algorithm that samples a signal over a period of time (or space) and divides it into its frequency components. The frequency spectrum of a signal is the distribution of the amplitudes and phases of each frequency component against frequency.
Step 4: Then we move on to Mel’s scale. This is a psychoacoustic scale of pitches of sounds, in the sense of their identification between bass and treble, which unity is mel. Mel is related to Hertz (Hz), the unit of measurement of the International System for Frequencies, by a relationship based on human hearing. It is a frequency scale closer to what the human ear is actually able to capture. The transfer formula is rather simple:
m = 2595log(1+f/700)
To simulate the human ear, it is necessary to go through a Filter Bank, a filter for each frequency. These filters have a triangular bandwidth response.
Step 5: Finally, we work on the cepstrum. We convert the logarithmic spectrum of Mel’s scale into time’s scale by using the DCT (Discrete Cosine Transform). A cepstrum is the result of taking the inverse transform of the logarithm of the estimated spectrum of a signal. The name “cepstrum” was derived by reversing the first four letters of “spectrum”.
Design of MFCCs extraction
The extraction of MFCCs can be done in many different ways. I implemented the mathematical operations sequences used in Kaldi toolkit.
The particular modification of this algorithm is the use of the Povey window (named by Kaldi’s owner) based on Hamming window. The other steps are the same as those presented in the scientific literature.
The particular point to note about MFCCs compute is the amount of data required, which is drastically reduced. Indeed, if we take windows of 25 ms sampled at 16000 Hz, we obtain 400 samples. These samples will be “transformed” into MFCCs that represent only 13 or 40 values in most cases. The amount of data to be stored and/or exchanged is at least reduced by a factor of ten.
Everything becomes complicated in an embedded context. In particular, we do not have as much available hardware resources as in a classic computer. As I worked with a Raspberry Pi 3, I worked on an ARM architecture.
There are many complicated mathematical formulas to apply in order to obtain our important data. In a real-time context, it is important to verify that this process consumes less CPU time than “real time”. To be more explicit, if you get audio data with a sample rate of 16000 Hz, then you get 16000 samples per second. So, the maximum time limit to work on real-time is to compute 16000 samples in less than one second.
The aim of this algorithm is to perform these calculations as quickly as possible using the least resources possible. This is only possible by optimizing what we have, including hardware architecture.
Arm NEON technology is an advanced SIMD (single instruction multiple data) architecture extension for the Arm Cortex-A series and Cortex-R52 processors. NEON technology is intended to improve the multimedia user experience by accelerating audio and video encoding/decoding, user interface, 2D/3D graphics or gaming. NEON can also accelerate signal processing algorithms and functions to speed up applications such as audio and video processing, voice and facial recognition, computer vision and deep learning.
That’s exactly what we need ! So we are going to use this implementation to boost up our algorithm. Now, we need a library to use this.
Ne10 is a library of common, useful functions that have been heavily optimised for Arm-based CPUs equipped with NEON SIMD capabilities. It provides consistent, well-tested behaviour, allowing for painless integration into a wide variety of applications via static or dynamic linking.
That looks great! Here we go! I used this library to support my code. Speaking of code, I don’t need to remind you that we do not choose a random language to write embedded code on target, right? Asfar as as I am concerned, I wrote mine using C++, as it gives us speed and flexibility of C and benefits of object-oriented programming. In sum, it looks like a good choice.
You can find the NE10 documentation here. The important functions to use are the Real-to-Complex FFT to process Fast Fourier Transform, and Matrix-Vector Multiplication to process Mel-Filter Bank.
This algorithm will be used in LinTO prototype.
As you can see, during a full request, the CPU load could raise near 50% of 4 cores. In the LinTO prototype, this new implementation has reduced the CPU load from 45% to 22%! It is possible thanks to the use of a more adapted language, optimized libraries for a given architecture and the computing power of the DSP (Digital Signal Processor).
Thank you for your reading! I hope you enjoyed it.
Stay tuned! Our finally tuned algorithm will soon be released as Open Source. Take a look on the official LinTO’s github, you can already find some voice recognition software there!
Coming next, a specific LinTO development card will be available as an Open-Hardware component.