Automatic Speech Recognition System using KALDI from scratch

Ravi Pandey
The Startup
Published in
Jun 5, 2020

Hello Researchers ! In this post, we will understand how to build an ASR system.


Kaldi is an opensource toolkit for speech recognition written in C++ and licensed under the Apache License v2.0. We can use it to train speech recognition models and decode audio from audio files.

Download and Install KALDI

You can skip this if you already done setup for KALDI.

git clone

Now, Go to the directory, open Install file, and compile the KALDI Framework according to the instruction given on that file.KALDI takes time during installation, so utilise that time and have some dark chocolate coffee. (Do you know kaldi was a legendary Ethiopian goatherd who discovered the coffee plant around 850 AD)

Let’s Talk about Speech Recognition

In general Speech Recognition framework:
1. Process incoming wav speech
2. Than from wave signal , we extract acoustic features using acoustic model
3. Linking those features to words or vocabulary or lexicon
4. Language model or grammar defines how words can be connected to each.

Let’s understand the folder structure

The “egs” folder contains example models and scripts for Kaldi. Make copy of any example folder and rename it. Below is your folder structure.

KALDI Default folder structure

Conf- folder contain the configuration file for compute-and-process-kaldi.

local , Steps and Utils- folders contain all the required files for creating language model and other supporting files for training and decoding ASR.

Data Preparation

The initial task is to properly curate the data as per KALDI format which includes the general files wav.scp, utt2spk, spk2utt, text, So create data folder inside your directory. Inside data folder create two more directory test and train. Also put wav format audio files to your base folder .

Make sure your wav audio file name have below naming convention (This step we are doing for our ease not necessary)

First 2 letter signify your language name (for example : for english- en or for spanish -sp) , Next 4 characters specify the speaker_id (suppose we have 100 different speaker data for the training then we can give id like 0001), Next character specifies the speaker gender(M or F) and the last four characters signify the sentence ID per Speaker. So your audio file name should be like en0001M0001 or en0002F0002.

Below are the steps for KALDI format data.

Create wav.scp file in your train folder and save it.

wav.scp file format (Pattern: <filename> <full_path_to_audio_file>)

Create text file and save it

text (Pattern: <filename> <text_transcription>)

utt2spk: create file on <filename> <speakerID> pattern and save it

spk2utt: Sentences spoken by each speaker. <group same speaker per uttarance> and save it.

spk: create file list on <lang_name+speakerid> pattern. ex: <en0001M> and save it.

utt: create file lists on <unique utterance id> pattern and save it.

Repeat same steps for your test folder.

Language Data Preparation :

Create lexicon.txt file inside your data/local/dict/ folder. This file contains every word from your dictionary with its ‘phonetic transcriptions” , see below example.

lexicon.txt (Pattern <word> <phone 1> <phone 2> )

The phonetic transcription of Reason can be R IY Z AH N or R ee Z ahn etc.

Below are the files that are required for Language model preparation.

nonsilence_phones.txt: This file lists nonsilence phones that are present in corpus like aa,umm etc. Create this file and save it.

optional_silence.txt: type sil and save to data/local/dict/ .

silence_phones.txt: Not contain the acoustic information but are present. (noise). type sil and save it. Now our next step is to create language model.

Language Model Preparation

Here we are working with N-gram language model, copy below script to your folder and replace the path , i have taken the n_gram=2 which mean that i am building bi-gram language model , you can change it with your requirement.

#set-up for single machine or cluster based execution
. ./
#set the paths to binaries and other executables
[ -f ] && . ./
#Creating input to the LM training
#corpus file contains list of all sentences
cat $basepath/data/train/text | awk '{first = $1; $1 = ""; print $0; }' > $basepath/data/train/transwhile read linedoecho "<s> $line </s>" >> $basepath/data/train/lmtrain.txtdone <$basepath/data/train/trans#*******************************************************************************#lm_arpa_path=$basepath/data/local/lmtrain_dict=dict
n_gram=2 # This specifies bigram or trigram. for bigram set n_gram=2 for tri_gram set n_gram=3
echo " Creating n-gram LM "

rm -rf $basepath/data/local/$train_dict/lexicon_c.txt $basepath/data/local/$train_lang $basepath/data/local/tmp_$train_lang $basepath/data/$train_lang
mkdir $basepath/data/local/tmp_$train_lang
utils/ --num-sil-states 3 data/local/$train_dict '!SIL' data/local/$train_lang data/$train_lang$kaldi_root_dir/tools/irstlm/bin/ -i $basepath/data/$train_folder/lmtrain.txt -n $n_gram -o $basepath/data/local/tmp_$train_lang/lm_phone_bg.ilm.gzgunzip -c $basepath/data/local/tmp_$train_lang/lm_phone_bg.ilm.gz | utils/ data/$train_lang/words.txt > data/local/tmp_$train_lang/oov.txtgunzip -c $basepath/data/local/tmp_$train_lang/lm_phone_bg.ilm.gz | grep -v '<s> <s>' | grep -v '<s> </s>' | grep -v '</s> </s>' | grep -v 'SIL' | $kaldi_root_dir/src/lmbin/arpa2fst - | fstprint | utils/ data/local/tmp_$train_lang/oov.txt | utils/ | utils/ | fstcompile --isymbols=data/$train_lang/words.txt --osymbols=data/$train_lang/words.txt --keep_isymbols=false --keep_osymbols=false | fstrmepsilon > data/$train_lang/G.fst$kaldi_root_dir/src/fstbin/fstisstochastic data/$train_lang/G.fstecho "End of Script"

save above code as and run sh on your terminal. you will see below output .

Language Model Process Flow

When you get success message enjoy ! You have created your first language model. For checking, go to your data folder and there you can see two directories local and langmodel, open langmodel and you will find below folder structure.

Compiled Language Model folder structure

G.fst is a word level grammar finite state transducer
L.fst is a pronunciation lexicon finite state transducer

Feature Extraction: In this step we extract MFCC features of each utterance (audio). Open your terminal and run below command.

steps/ -nj 4 data/train exp/make_mfcc/train mfcc used for computing MFCC coefficients.
nj- number of jobs, you can set it to according to your cpu.
data/train: folder path for which you want to compute MFCC.
log file stored in this directory.
mfcc: directory name where we store extracted feature.

MFCC Feature Extraction

For extracting cepstral mean and variance statistics indexed by speakers run below command.

steps/ data/train exp/make_mfcc/train mfcc
CMVN Indexing

Acoustic Model Preparation

In this step we train the Monophone HMM system , by using below command.

steps/ --nj 4 data/train data/langmodel exp/mono
Training flow of Monophone HMM System

Run below command for combining acoustic model and language model to get the final model.

utils/ — — mono data/langmodel exp/mono exp/mono/graph

Yeah ! Finally you have developed training model for your own ASR System.


For checking how your ASR system performing, use below command on unseen testing data

steps/ — nj 4 exp/mono/graph data/test exp/mono/decode

To see the decoded results use below command:

utils/ -f 2- data/langmodel/words.txt exp/mono/decode/scoring/3.tra

Now we successfully build an ASR system for custom language. YAY !
Credits & References

Article is based on Kaldi documents or Kaldi Github
KALDI Lectures
Special Thanks to Nikhil Sharma and Priyanka for helping me in collecting audio data and data preparation part.

