Decoding an audio file using a pre-trained model with Kaldi

4 min readFeb 2, 2019

Many of you wondering that you do not have enough resources like Audio data, transcriptions, and more importantly hardware to training a good model. So, it is wise to use a pre-trained model which was already trained by researchers who had accesses to all the resources unless you would want to beat SOTA results. To explain how one could use a pre-trained model, I am here considering the ASpIRE model we get from Kaldi downloads repository. I am considering ASpIRE model as it is trained on Fisher English dataset that has been augmented with impulse responses and noises to create multi-condition training. Okay…. Let’s first understand what you would need to decode an audio file.

An audio file sampled at 8khz as the model was trained on mfccs generated from 8Khz audio dataset. The path to the audio file has to be mentioned in a file called wav.scp which Kaldi scripts expect. wav.scp has two columns, first column containing utterance or speaker id (as we only talking about one file here) and in the second column path to wav file. If your audio is not in wav format then you would need to pipe to a binary(like sph2pipe) to convert audio file to wav format. If your audio file is not sampled at 8khz as was mine then we can pipe the sox binary output to further process as shown below. I didn’t use sph2pipe because my audio file is already in wav format. (Link to my audio file)

$ soxi test_file.wav
Input File     : 'test_file.wav'
Channels       : 1
Sample Rate    : 16000
Precision      : 16-bit
Duration       : 00:00:03.12 = 50000 samples ~ 234.375 CDDA sectors
File Size      : 100k
Bit Rate       : 256k
Sample Encoding: 16-bit Signed Integer PCM

$ sox -t wav test_file.wav -c 1 -r 8000 -t wav - | soxi -
Input File     : '-'
Channels       : 1
Sample Rate    : 8000
Precision      : 16-bit
Duration       : 00:00:03.12 = 25000 samples ~ 234.375 CDDA sectors
File Size      : 0
Bit Rate       : 0
Sample Encoding: 16-bit Signed Integer PCM

$ cd ..
ASR_demo$ cat wav.scp
user_123 /usr/local/bin/sox -t wav ~/Desktop/ASR_demo/audio/test_file.wav -c 1 -b 16 -r 8000 -t wav - |

As you guessed I am trying to decode an audio file called test_file.wav present in `~/Desktop/ASR_demo/audio` directory.

create a file utt2spk with an utterance (user_123 as shown above) followed by the speaker (speaker name, here me) in ~/Desktop/ASR_demo/ directory. wav.scp and utt2spk are only two files we would create ourselves, rest we download from Kaldi website.

2. Now we require an acoustic model which we usually term as final.mdl, and graph fst (HCLG.fst), decoding graph which is a composition of Grammar, HMM States, Context, and Lexicon. I am not going into details of these terms as of now, I will try to explain these in future blogs.

3. We need to build fst based on our language model (G.fst) (if we need to restrict words or change language in total).

4. Now download MODEL files from http://kaldi-asr.org/models/1/0001_aspire_chain_model.tar.gz and untar it in existing egs/aspire/s5/ path.

Once you extracted the file, we need to prepare our data so as to match our input features dimension as the model expects.

6. Create a mfcc_hires.conf file with the configuration as mentioned below in s5/conf/ directory. This configuration file is used to generate features required for the model to decode

$ cat conf/mfcc_hires.conf
# config for high-resolution MFCC features, intended for neural network training.
# Note: we keep all cepstra, so it has the same info as filterbank features,
# but MFCC is more easily compressible (because less correlated) which is why
# we prefer this method.
--use-energy=false # use average of log energy, not energy.
--sample-frequency=8000 # Put the sampling frequency of your audio file
--num-mel-bins=40 # similar to Google's setup.
--num-ceps=40 # there is no dimensionality reduction.
--low-f req=40 # low cutoff frequency for mel bins
--high-freq=-200 # high cutoff frequently, relative to Nyquist of 4000 (=3800)

7. Then run the script as shown below for decoding:

$ steps/online/nnet3/prepare_online_decoding.sh --mfcc-config conf/mfcc_hires.conf data/lang_c
hain exp/nnet3/extractor exp/chain/tdnn_7b exp/tdnn_7b_chain_online |

— → this will generate ivector files, and respective configuration files as required for decoding. spare some time to see what files are present in exp/chain/tdnn_7b and exp/tdnn_7b_chain_online

8. As discussed before now we need to generate composed graph HCLG.fst based on our existing G.fst (Grammar), L.fst (Lexicon) (see data/lang_pp_test directory). To generate HCLG.fst run

utils/mkgraph.sh - self-loop-scale 1.0 data/lang_pp_test exp/tdnn_7b_chain_online exp/tdnn_7b_chain_online/graph

The above command creates HCLG.fst file in exp/tdnn_7b_chain_online/graph directory

9. Now only step remaining is to decode the file based on mdl file and graph… the command to generate the decoded lattices is

steps/online/nnet3/decode.sh - nj 1 - acwt 1.0 - post-decode-acwt 10.0 exp/tdnn_7b_chain_online/graph ~/Desktop/ASR_demo/ exp/tdnn_7b_chain_online/decode_ASR_demo

Above command takes the directory where our wav.scp file is present and our lattices would be generated in exp/tdnn_7b_chain_online/decode_ASR_demo directory.

10. These lattices contain all probable words with a certain probability. So we need some command to pick the best words for us based on minimum language model error. For that, we use the command

lattice-best-path ark:'gunzip -c exp/tdnn_7b_chain_online/ASR_decode/lat.1.gz |' 'ark,t: | int2sym.pl -f 2- exp/tdnn_7b_chain_online/graph/words.txt > ~/Desktop/ASR_demo/decoded_text.txt'

Here output is stored in ~/Desktop/ASR_demo/decoded_text.txt file as shown below :

$ cat ~/Desktop/ASR_demo/decoded_text.txt
user_123 [noise] i hope you learn something useful

We can also view output along with some commands which Kaldi ran through scripts in exp/tdnn_7b_chain_online/ASR_decode/log/decode.1.log file.

That’s it for now. I hope you learned something useful 😄.

This is my first technical blog post. Let me know if you found this useful. Would love to hear some feedback :)

Decoding an audio file using a pre-trained model with Kaldi

Written by Nithin Rao Koluguri