Speaker Verification using Pocketsphinx

6 min readAug 9, 2017

Pocketsphinx is one of the best lightweight speech recognition engines out there. Every now and then, people with interesting ideas try to implement them using pocketsphinx. But there is one such domain which has been left pretty much untouched until now! The official docs of CMUSphinx have clearly written:

Although we try to provide important examples, we obviously can’t cover everything. There is no utterance verification or speaker identification example yet, though they could be created later.

So, my mentor and I have taken up the task of facilitating speaker verification using pocketsphinx! We got to work and the past two weeks have been spent trying to come up and test out an initial version of how this could be achieved without using any extra tools. There are many open source toolkits available which allow for speaker identification. These include ALIZE which provides a set of low level and high level frameworks for anybody to develop applications using speaker recognition, LIUM Speaker Diarization toolkit, SPEAR: A speaker recognition toolkit based on Bob and SIDEKIT which provides a whole chain of tools required to perform speaker recognition. But as a start, we have restrained from using external toolkits so as to make things simple for developers already familiar with pocketsphinx!

After much research and debugging, we have sort of “hacked” our way through Sphinx to come up with two solutions for speaker verification. This blog post will discuss these two approaches, their success rate, and possibilities for improvements. If you want to choose one and test it out, I would recommend the second approach as it is based on adaptation of acoustic model which is suitable for making the model trained on an already existing database adapt to your own voice, accent and recording environment.

The first approach

is based on the article: Training Acoustic Model For CMUSphinx. It consists of EM training of three hidden Markov models (HMMs): silence model(SIL), universal background model(UBM) and speaker model. The approach is as follows:

First, you will need some recording of the person whose voice you want to verify. Let’s say its yours! Search online for some generic speech database (for example an4, which is quite small but would work for our example) and record a similar set of short sentence audio files. At least a 30–60 minutes recording would be needed for this to work although you can test with smaller set at the cost of poor accuracy.
Download the sample database for training which is available at an4 database NIST’s Sphere audio(.sph) format.
Inside the etc/ folder of an4 dataset, remove all the content from an4.dic and replace it with the following two lines:

speech sp
pankaj pk (Replace with your name and corresponding phoneme of your choice)

While setting up the training scripts, first replace all words mentioned in the two transcription files(test and train) with speech. The transcription file should like something like this:

<s> speech </s> (an251-fash-b)
<s> speech </s> (an253-fash-b)
<s> speech </s> (an254-fash-b)
<s> speech speech speech speech speech speech </s> (an255-fash-b)
<s> speech speech speech speech speech speech </s> (cen1-fash-b)
...

Now add suitable lines in fileids and transcription files for your recordings. In transcription file, for your recordings, add pankaj or your name for each word in the recording. Please ensure that you have placed the recordings inside the wav/ folder as mentioned in the article Training Acoustic Model For CMUSphinx. The file structure would be similar to this image:

Modify an4.phone to contain only the two new phones that we are using: sp and pk.
Now run the command:
sphinxtrain -t an4 setup
In case of errors or warnings, stop here and resolve those first.
You also need to change your language model. The default model located in etc/ folder would be of the format .lm.DMP. We need to replace it with grammar model or kws model. The new model would be stored in etc/ folder:

#JSGF V1.0
grammar simple;
public <spkverif> = ([speech|pankaj])*;

This will copy all the required configuration files into etc/ subfolder of your database folder and prepare the database for training. A new file named sphinx_train.cfg must also have been added to it. Make sure the following variables have the appropriate values set:
a) $CFG_WAVFILE_EXTENSION = ‘sph’
Here I am using sph as I have converted my wav files into sph using the following command:
sox -t wav name_of_your_file.wav -t new_file_name.sp

You can of course use wav files directly. You will have to replace sph with wav and nist with mswav (in the next command)
b) $CFG_WAVFILE_TYPE = ‘nist’;
c) Uncomment $CFG_HMM_TYPE = ‘.cont.’;
d) $CFG_CI_MGAU = ‘yes’;
e) $CFG_CD_TRAIN = ‘no’;
f) $DEC_CFG_MODEL_NAME = “CFG_EXPTNAME.ci_${CFG_DIRLABEL}”;
g) Comment the line $DEC_CFG_LANGUAGEMODEL and uncomment the line $ DEC_CFG_GRAMMAR. Modify its value with the name of your language model similar to this:
$DEC_CFG_GRAMMAR = “$CFG_BASE_DIR/etc/an4.jsgf”;
The grammar file is called an4.jsgf, hence the value above. Change it accordingly.
Run the following command to start the training process:
sphinxtrain run
Again, make sure that you haven’t missed any error or warning message. It can lead to weird situations later on.

Once the above procedure is successfully complete, you can use the model present at model_parameters/an4.ci_cont to test it out using any of the supported language models including jsgf grammar, kws or lm.

The second approach

is based on the article: Adaptation of the acoustic model. The idea is to train two models: one for UBM(SP) and the other for your voice(PK) on the same an4 dataset (unlike the previous approach where PK was trained on your voice directly). You need to follow most of the instructions mentioned in the previous approach with some modifications listed below. Then use adaptation to train PK based model on your voice. The difference in initial training is as follows:

You do not need to use your voice anywhere. So take a fresh copy of an4 database and work on that without copying your recordings into it.
Before running sphinxtrain -t an4 setup, make a copy of each line in fileids and transcriptions files (both test and train). Then replace speech with pankaj(your name) in the copied lines in transcription. This is being done to train two exactly same models- one for UBM and the other PK. It could of course be done using a single model and then copying the means, variances and mixture weights of the genrated model. But for simplicity, we simply form duplicate copies of fileids and transcriptions:

<s> speech </s> (an251-fash-b)
<s> pankaj </s> (an251-fash-b)
<s> speech </s> (an253-fash-b)
<s> pankaj </s> (an253-fash-b)
<s> speech </s> (an254-fash-b)
<s> pankaj </s> (an254-fash-b)
<s> speech speech speech speech speech speech </s> (an255-fash-b)
<s> pankaj pankaj pankaj pankaj pankaj pankaj</s> (an255-fash-b)
…

After successfully following the instructions in previous approach, the real work begins.
We are now going to follow the article: Adaptation of the acoustic model. Just use your model instead of en-us and you are good to go. Keep following the instructions present there. You first generate the acoustic model feature files for your recordings (which must be in wav format) using sphinx_fe command. Then you collect statistics from the adaption data using the bw program. You don’t need to follow the MLLR guide. Instead move to “Updating the acoustic model files using MAP” section.
Once you are done with these instructions, you will have the required acoustic model adapted on your voice!

Testing

Now all that is left to do is to test the two approaches! It would be helpful if you have followed my previous post on using kws mode in pocketsphinx as we are going to use my ROS package to test the newly developed models. In three separate terminals, run the following commands:

roscore
rostopic echo /kws_data
roslaunch pocketsphinx kws.launch dict:=<location_of_your_dict_file> kws:=<location_of_kwlist_file> hmm:=<location_of_new_acoustic_model>

The dict file would again be similar to the one created above containing the two lines “speech sp” and “pankaj pk”.

The kwlist file would again contain lines like these:

speech /1e-17/
pankaj /1e-16/

In case you are not getting good results, adjust the thresholds in the kwlist file or add more recordings of your voice. Make sure they do not have much noise and are recorded in the appropriate format (sample rate: 16000, single channel, 16-bit signed and mono recording)

Conclusion

Of course the models can not be assumed to be very accurate as they are initial stage versions. There exist much better approaches these days which are dominated by i-vectors and neural nets. But as a starting point, the above approaches serve the purpose of speech verification and are able to differentiate my voice from everyone else’s with a decent accuracy. We still need to plan some accuracy metrics to understand what works better and to compare them with the standard approaches used for speaker verification. So, stay tuned in the coming weeks!

Happy Coding!

Speaker Verification using Pocketsphinx

The first approach

The second approach

Testing

Conclusion

Written by Pankaj Baranwal