Acoustic Model (GMM-HMM) Training in Kaldi

Sirigirajumeenakshi
10 min readNov 9, 2022

--

Acoustic Model: The acoustic model (AM) is a component of Automatic Speech Recognition (ASR), the job is to predict which sound, or phoneme, from the phone set is being spoken in each frame of audio. The acoustic models are created by training the models on acoustic features from labeled data, such as the Librispeech, TIMIT, Fisher corpus or any other transcribed speech corpus.

In order to train AM in Kaldi will follow below steps:

  1. Preparation of Data
  2. Create directories required to train AM
  3. Create files for data/train
  4. Create files for data/local/lang
  5. Create files for data/lang
  6. Set the parallelization wrapper
  7. Create files for conf
  8. Extract MFCC features
  9. Monophone training and alignment
  10. Triphone training and alignment
  1. Preparation of Data: To train AM in Kaldi all we need is audio files and their respective text.
  2. Create directories required to train AM: We will create directories and files starting from “mycorpus”.
Directory structure that we are going to create
cd kaldi/egs
mkdir mycorpus
cd mycorpus
ln -s ../wsj/s5/steps .
ln -s ../wsj/s5/utils .
ln -s ../../src .

cp ../wsj/s5/path.sh .

Since the mycorpus directory is a level higher than wsj/s5, we need to edit the path.sh file.

vim path.sh# Change the path line in path.sh from:
export KALDI_ROOT='pwd'/../../..
# to:
export KALDI_ROOT='pwd'/../..

Create directories in mycorpus: exp, conf, data. Within data: train, lang, local and local/lang.

cd mycorpus
mkdir exp
mkdir conf
mkdir data

cd data
mkdir train
mkdir lang
mkdir local

cd local
mkdir lang

3. Create files for data/train:

Contains information regarding the specifics of the audio files, transcripts, and speakers.

a) text → transcription of audio files

%text file format: utterance_ID sentence/sequence of words814-1211-0000 GO DO YOU HEAR
814-1211-0001 ASKED MORREL YES
814-1211-0002 MUST I LEAVE ALONE NO
....
....
814-1211-0215 BUT CAN HE UNDERSTAND YOU YES
814-1211-0216 WHAT DO YOU MEAN SIR

b) segments → contains the start and end time for each utterance in an audio file.(Optional)

%segments file format: utterance_ID file_ID start_time end_time
%start and end time are in seconds
814-1211-0000 814-1211 0.0 3.44
814-1211-0001 814-1211 4.60 8.54
814-1211-0002 814-1211 9.45 12.05
....
....
820-1231-0215 820-1231 128.26 131.24
820-1231-0216 820-1231 148.26 153.24

c) wav.scp → contains the location for each of the audio files(wav fromat).

%wav.scp file format: file_ID path/filename814-1211-0000 /Downloads/data/814/1211/814-1211-0000.wav
814-1211-0001 /Downloads/data/814/1211/814-1211-0001.wav
814-1211-0002 /Downloads/data/814/1211/814-1211-0002.wav
....
....
820-1231-0215 /Downloads/data/820/1231/820-1231-0215.wav
820-1231-0216 /Downloads/data/820/1231/820-1231-0216.wav

d) utt2spk → contains the mapping of each utterance to its corresponding speaker. The concept of “speaker” does not have to be related to a person — it can be a room, accent, gender, or anything that could influence the recording. When speaker normalization is performed then, the normalization may actually be removing effects due to the recording quality or particular accent type. This definition of “speaker” then is left up to the modeler.

Use below python code to automatically create utt2spk file:

f= open("wav.scp","r")
Lines = f.readlines()
f1= open("utt2spk","w+")
for line in Lines:
f1.write(line.split(" ")[0]+" "+line.split(" ")[0].split("-")
[0]+"\n")
f.close()
f1.close()

Note: Run the code from data/train directory.

%utt2spk file format: utterance_ID speaker_ID814-1211-0000 814
814-1211-0001 814
814-1211-0002 814
....
....
820-1231-0215 820
820-1231-0216 820

e) spk2utt → contains the speaker to utterance mapping. Run below code to automatically create spk2utt file.

utils/fix_data_dir.sh data/train

The created file will be in below format:

%spk2utt file format: speaker_ID utterance_ID1 utterance_ID2 ...
814 814-1211-0000 814-1211-0001 814-1211-0002 814-1211-0003 ...
...
...
820 820-1231-0000 820-1231-0001 820-1231-0002 820-1231-0003 ...

4. Create files for data/local/lang:

Contains language data, specific to the your own corpus.

a) lexicon.txt → contains words and their pronunciations that are present in the corpus. The pronunciation alphabet must be based on the same phonemes you wish to use for your acoustic models. We must also include lexical entries for each “silence” or “out of vocabulary” phone model we wish to train.

Note: From data/train/text will pick words and will only keep those words pronunciation in lexicon.txt.

%run in data/train
cut -d ' ' -f 2- text | sed 's/ /\n/g' | sort -u > words.txt
% run python code in mycorpus directory
cd mycorpus
python filter_lexicon.py
%python code: filter_lexicon.pyimport osref = dict()
phones = dict()
with open("../lexicon") as f:
for line in f:
line = line.strip()
columns = line.split(" ", 1)
word = columns[0]
pron = columns[1]
try:
ref[word].append(pron)
except:
ref[word] = list()
ref[word].append(pron)
lex = open("../local/lang/lexicon.txt", "w")
with open("../train/words.txt") as f:
for line in f:
line = line.strip()
if line in ref.keys():
for pron in ref[line]:
lex.write(line + " " + pron+"\n")
else:
print("Word not in lexicon:" + line)

b) nonsilence_phones.txt → contains a list of all the phones that are not silence.

# this should be interpreted as one line of code
cut -d ' ' -f 2- lexicon.txt | sed 's/ /\n/g' | sort -u > nonsilence_phones.txt

c) silence_phones.txt → contain a ‘SIL’ (silence) and ‘oov’ (out of vocabulary) model.

echo –e 'SIL'\\n'oov' > silence_phones.txt

d) optional_silence.txt → contain only ‘SIL’ model.

echo 'SIL' > optional_silence.txt 

e) extra_questions.txt → A Kaldi script will generate a basic extra_questions. txt file for you, but in data/lang/phones. This file “asks questions” about a phone’s contextual information by dividing the phones into two different sets. An algorithm then determines whether it is at all helpful to model that particular context. The standard extra_questions.txt will contain the most common “questions.” An example would be whether the phone is word-initial vs word-final. If you do have extra questions that are not in the standard extra_questions.txt file, they would need to be added here.

5. Create files for data/lang:

Using data/local/lang files and a kaldi script will automatically generate all of the files in data/lang.

# Syntax:
utils/prepare_lang.sh <dict-src-dir> <oov-dict-entry> <tmp-dir> <lang-dir>
cd mycorpus
utils/prepare_lang.sh data/local/lang '<OOV>' data/local/ data/lang

Note: The second argument in above refers to lexical entry (word) for a “spoken noise” or “out of vocabulary” phone. We need to make sure this entry and its corresponding phone (oov) are entered in lexicon.txt and the phone is listed in silence_phones.txt.

After running the above script the following files will be generated in data/lang: L.fst, L_disambig.fst, oov.int, oov.txt, phones.txt, topo, words.txt, and phones. phones is a directory containing many additional files, including the extra_questions.txt file (gives info about how the model learning more about a phoneme’s contextual information).

6. Set the parallelization wrapper:

Training can be computationally expensive; however, if you have multiple processors/cores or even multiple machines, there are ways to speed it up significantly. Both training and alignment can be made more efficient by splitting the dataset into smaller chunks and processing them in parallel. The number of jobs or splits in the dataset will be specified later in the training and alignment steps. Kaldi provides a wrapper to implement this parallelization so that each of the computational steps can take advantage of the multiple processors. Kaldi’s wrapper scripts are run.pl, queue.pl, and slurm.pl, along with a few others we won’t discuss here. The applicable script and parameters will then be specified in a file called cmd.sh located at the top level of your corpus’ training directory.

  • run.pl allows you to run the tasks on a local machine (e.g., your personal computer).
  • queue.pl allows you to allocate jobs on machines using Sun Grid Engine.
  • slurm.pl allows you to allocate jobs on machines using another grid engine software, called SLURM.

Below is example script for running on a local machine:

cd mycorpus  
vim cmd.sh

# Insert the following text in cmd.sh
train_cmd="run.pl"
decode_cmd="run.pl"

Once you’ve quite vim, then run the file:

cd mycorpus  
. ./cmd.sh

7. Create files for conf:

The directory conf requires one file mfcc.conf, which contains the parameters for MFCC feature extraction. The sampling frequency should be modified to reflect your audio data. This file can be created manually or within the shell with the following code:

# Create mfcc.conf by opening it in a text editor like vim
cd mycorpus/conf
vim mfcc.conf

# Insert the following text in mfcc.conf

--use-energy=false
--sample-frequency=16000

8. Extract MFCC features:

Run below code to extract the MFCC acoustic features and compute the cepstral mean and variance normalization (CMVN) stats. After each process, it also fixes the data files to ensure that they are still in the correct format. The --nj option is for the number of jobs to be sent out. This number is currently set to 16 jobs, which means that the data will be divided into 16 sections. It is good to note that Kaldi keeps data from the same speakers together, so you do not want more splits than the number of speakers you have.

cd mycorpus  

mfccdir=mfcc
x=data/train
steps/make_mfcc.sh --cmd "$train_cmd" --nj 16 $x exp/make_mfcc/$x $mfccdir
steps/compute_cmvn_stats.sh $x exp/make_mfcc/$x $mfccdir

9. Monophone training and alignment:

A) Take subset of data for monophone training

The monophone models are the first part of the training procedure. We will only train a subset of the data mainly for efficiency. Reasonable monophone models can be obtained with little data, and these models are mainly used to bootstrap training for later models.

The listed argument options for this script indicate that we will take the first part of the dataset, followed by the location the data currently resides in, followed by the number of data points we will take (10,000), followed by the destination directory for the training data.

cd mycorpus  
utils/subset_data_dir.sh --first data/train 10000 data/train_10k

B) Train monophones

Each of the training scripts takes a similar baseline argument structure with optional arguments preceding those. The one exception is the first monophone training pass. Since a model does not yet exist, there is no source directory specifically for the model. The required arguments are always:

- Location of the acoustic data: `data/train` 
- Location of the lexicon: `data/lang`
- Source directory for the model: `exp/lastmodel`
- Destination directory for the model: `exp/currentmodel`

The argument --cmd “$train_cmd” designates which machine should handle the processing. Recall from above that we specified this variable in the file cmd.sh. The argument --nj should be familiar at this point and stands for the number of jobs. Since this is only a subset of the data, we have reduced the number of jobs from 16 to 10. Boost silence is included as standard protocol for this training.

steps/train_mono.sh --boost-silence 1.25 --nj 10 --cmd "$train_cmd" \
data/train_10k data/lang exp/mono_10k

C) Align monophones

Just like the training scripts, the alignment scripts also adhere to the same argument structure. The required arguments are always:

- Location of the acoustic data: `data/train`
- Location of the lexicon: `data/lang`
- Source directory for the model: `exp/currentmodel`
- Destination directory for the alignment: `exp/currentmodel_ali`
steps/align_si.sh --boost-silence 1.25 --nj 16 --cmd "$train_cmd" \
data/train data/lang exp/mono_10k exp/mono_ali || exit 1;

10) Triphone training and alignment:

A) Train delta-based triphones

Training the triphone model includes additional arguments for the number of leaves, or HMM states, on the decision tree and the number of Gaussians. In this command, we specify 2000 HMM states and 10000 Gaussians. As an example of what this means, assume there are 50 phonemes in our lexicon. We could have one HMM state per phoneme, but we know that phonemes will vary considerably depending on if they are at the beginning, middle or end of a word. We would therefore want at least three different HMM states for each phoneme. This brings us to a minimum of 150 HMM states to model just that variation. With 2000 HMM states, the model can decide if it may be better to allocate a unique HMM state to more refined allophones of the original phone. This phoneme splitting is decided by the phonetic questions in questions.txt and extra_questions.txt. The allophones are also referred to as subphones, senones, HMM states, or leaves.

The exact number of leaves and Gaussians is often decided based on heuristics. The numbers will largely depend on the amount of data, number of phonetic questions, and goal of the model. There is also the constraint that the number of Gaussians should always exceed the number of leaves. As you’ll see, these numbers increase as we refine our model with further training algorithms.

steps/train_deltas.sh --boost-silence 1.25 --cmd "$train_cmd" \
2000 10000 data/train data/lang exp/mono_ali exp/tri1 || exit 1;

B) Align delta-based triphones

steps/align_si.sh --nj 24 --cmd "$train_cmd" \
data/train data/lang exp/tri1 exp/tri1_ali || exit 1;

C) Train delta + delta-delta triphones

Delta+delta-delta training computes delta and double-delta features, or dynamic coefficients, to supplement the MFCC features. Delta and delta-delta features are numerical estimates of the first and second order derivatives of the signal (features). As such, the computation is usually performed on a larger window of feature vectors. While a window of two feature vectors would probably work, it would be a very crude approximation (similar to how a delta-difference is a very crude approximation of the derivative). Delta features are computed on the window of the original features; the delta-delta are then computed on the window of the delta-features. Run below script to train LDA-MLLT triphones.

steps/train_deltas.sh --cmd "$train_cmd" \
2500 15000 data/train data/lang exp/tri1_ali exp/tri2a || exit 1;

D) Align delta + delta-delta triphones

steps/align_si.sh  --nj 24 --cmd "$train_cmd" \
--use-graphs true data/train data/lang exp/tri2a exp/tri2a_ali || exit 1;

E) Train LDA-MLLT triphones

LDA-MLLT stands for Linear Discriminant Analysis — Maximum Likelihood Linear Transform. The Linear Discriminant Analysis takes the feature vectors and builds HMM states, but with a reduced feature space for all data. The Maximum Likelihood Linear Transform takes the reduced feature space from the LDA and derives a unique transformation for each speaker. MLLT is therefore a step towards speaker normalization, as it minimizes differences among speakers. Run below script to train LDA-MLLT triphones.

steps/train_lda_mllt.sh --cmd "$train_cmd" \
3500 20000 data/train data/lang exp/tri2a_ali exp/tri3a || exit 1;

F) Align LDA-MLLT triphones with FMLLR

steps/align_fmllr.sh --nj 32 --cmd "$train_cmd" \
data/train data/lang exp/tri3a exp/tri3a_ali || exit 1;

G) Train SAT triphones

SAT stands for Speaker Adaptive Training. SAT also performs speaker and noise normalization by adapting to each specific speaker with a particular data transform. This results in more homogenous or standardized data, allowing the model to use its parameters on estimating variance due to the phoneme, as opposed to the speaker or recording environment. Run below script to train SAT triphones.

steps/train_sat.sh  --cmd "$train_cmd" \
4200 40000 data/train data/lang exp/tri3a_ali exp/tri4a || exit 1;

H) Align SAT triphones with FMLLR

FMLLR stands for Feature Space Maximum Likelihood Linear Regression. After SAT training, the acoustic model is no longer trained on the original features, but on speaker-normalized features. For alignment, we essentially have to remove the speaker identity from the features by estimating the speaker identity (with the inverse of the fMLLR matrix), then removing it from the model (by multiplying the inverse matrix with the feature vector). These quasi-speaker-independent acoustic models can then be used in the alignment process. Run below script to align SAT triphones with FMLLR.

steps/align_fmllr.sh  --cmd "$train_cmd" \
data/train data/lang exp/tri4a exp/tri4a_ali || exit 1;

Note: Based on my understanding and change in few scripts of eleanor chodroff tutorial, the blog was written.

Reference:

https://eleanorchodroff.com/tutorial/kaldi/index.html

--

--