How to start with Kaldi and Speech Recognition
The best way to achieve a state of the art speech recognition system
What is Kaldi?
Kaldi is an open source toolkit made for dealing with speech data. it’s being used in voice-related applications mostly for speech recognition but also for other tasks — like speaker recognition and speaker diarisation. The toolkit is already pretty old (around 7 years old) but is still constantly updated and further developed by a pretty large community. Kaldi is widely adopted both in Academia (400+ citations in 2015) and industry.
Kaldi is written mainly in C/C++, but the toolkit is wrapped with Bash and Python scripts. For basic usage this wrapping spares the need to get in too deep in the source code. Over the course of the last 5 months I learned about the toolkit and about using it. The goal of this article is to guide you through that process and give you the materials that helped me the most. See it as a shortcut.
This article will include a general understanding of the training process of a Speech Recognition model in Kaldi, and some of the theoretical aspects of that process.
This article won’t include code snippets and the actual way for doing those things in practice. For that matter you can read the “Kaldi for Dummies” tutorial or other material online.
The three parts of Kaldi
Preprocessing and Feature Extraction
Today most of the models that deal with audio data work with some pixel-based representation of that data. When you want to extract such representation you’ll mostly want to use features that will be good for two things:
- Identifying the sound of human speech
- Discarding any unnecessary noise.
Over the years there were several tries to make those features and today MFCCs which are widely used in the industry.
MFCC stands for Mel-Frequency Cepstral Coefficients and it has become almost a standard in the industry since it was invented in the 80s by Davis and Mermelstein. You can get a better theoretical explanation of MFCCs in this amazing readable article. For basic usage all you need to know is that MFCCs are taking into account only the sounds that are best heard by our ears.
In Kaldi we use two more features:
- CMVNs which are used for better normalization of the MFCCs
- I-Vectors (That deserve an article of their own) that are used for better understanding of the variances inside the domain. For example - creating a speaker dependent representation. I-Vectors are based on the same ideas of JFA (Joint Factor Analysis), but are more suitable for understanding both channel and speaker variances. The math behind I-Vectors is clearly described here and here.
For a basic understanding of these concepts, remember the following things:
- MFCC and CMVN are used for representing the content of each audio utterance.
- I-Vectors are used for representing the style of each audio utterance or speaker.
The Model
The matrix math behind Kaldi is implemented in either BLAS and LAPACK (Written in Fortran!),or with an alternative GPU implementation based on CUDA. Because of its usage of such low-level packages, Kaldi is highly efficient in performing those tasks.
Kaldi’s model can be divided into two main components:
The first part is the Acoustic Model, which used to be a GMM but now it was wildly replaced by Deep neural networks. That model will transcribe the audio features that we created into some sequence of context-dependent phonemes (in Kaldi dialect we call them “pdf-ids” and represent them by numbers).
The second part is the Decoding Graph, which takes the phonemes and turns them into lattices. A lattice is a representation of the alternative word-sequences that are likely for a particular audio part. This is generally the output that you want to get in a speech recognition system. The decoding graph takes into account the grammar of your data, as well as the distribution and probabilities of contiguous specific words (n-grams).
The decoding graph is essentially a WFST and I highly encourage anyone that wants to professionalize to learn this subject thoroughly. The easiest way to do it is through those videos and this classic article. After understanding both of those you can understand the way that the decoding graph works more easily. This composition of different WFSTs is named in Kaldi project — “HCLG.fst file” and it’s based on the open-fst framework.
Worth Noticing: This is a simplification of the way that the model works. There is actually a lot of detail about connecting the two models with a decision tree and about the way that you represent the phonemes, but this simplification can help you to grasp this process.
You can learn in depth about the entire architecture in the original article describing Kaldi and about the decoding graph specifically in this amazing blog.
The Training Process
In general, that's the trickiest part. In Kaldi you’ll need to order your transcribed audio data in a really specific order that is described in depth in the documentation.
After ordering your data, you’ll need a representation of each word to the phonemes that create them. This representation will be named “dictionary” and it will determine the outputs of the acoustic model. Here is an example of such dictionary:
eight -> ey t
five -> f ay v
four -> f ao r
nine -> n ay n
When you have both of those things at hand, you can start training your model. The different training steps you can use are named in Kaldi dialect “recipes”. The most wildly used recipe is WSJ recipe and you can look at the run bash script for a better understanding of that recipe.
In most of the recipes we are starting with aligning the phonemes into the audio sound with GMM. This basic step (named “alignment”) helps us to determine what is the sequence that we want our DNN to spit out later.
After the alignment we will create the DNN that will form the Acoustic Model, and we will train it to match the alignment output. After creating the acoustic model we can train the WFST to transform the DNN output into the desired lattices.
“Wow, that was cool! What can I do next?”
- Try it out.
- Read more
How to try
Download this Free Spoken Digit Dataset, and just try to train Kaldi with it! You should probably try to vaguely follow this. You can also just use one of the many different recipes mentioned above.
If you succeed, try to get more data. if you fail, try asking questions in the Kaldi help group.
Where to read more
I cannot emphasize how much this 121 slides presentation helped me it’s based mostly on the series of those lectures. Another great source is everything from Josh Meyer’s website. You can find those links and many more in my Github-Kaldi-awesome-list.
Try reading through the forums, try to dig deeper into the code and try to read more articles, I’m pretty sure you will have some fun. :)
If you have any question fill free to ask them here or contact me through my Email and fill free to follow me on Twitter or on Linkedin.