Automatic tuning of keyword spotting thresholds
If you have used pocketsphinx before, chances are that you are familiar with keyword spotting mode. If not, you can find out more about it through the official docs available here.
So, in kws mode, you need to provide three major input parameters to the speech recognition engine beside the speech input. These are:
- The acoustic model: It represents the relationship between an audio signal and the phonemes or other linguistic units that make up speech. The model is learned from a set of audio recordings and their corresponding transcripts. If you want to construct your own acoustic model, you can find the details for the same in this article.
- The dictionary: This contains all the words which user wants the system to detect and the corresponding phonetic transformations.
- A combination of keyphrase and its corresponding threshold or a file containing a keyword list.
The keyword list contains all the key-phrases that you want the engine to detect in the speech along with their threshold values. The threshold values need to be correctly set in order for the engine to give good results. If it is too low, then the detection might contain way too many false alarms and thus give undesirable results. If it is too high, the engine might miss the detection for many utterances of the corresponding key-phrases.
These threshold values depend on a lot of factors including the model quality, recording quality, the keywords, user speech accent etc. Thus there is no “predefined” keyword list which would work out-of-the-box. And if you are just starting to get familiar with pockesphinx, it can be difficult to find out the correct threshold values on your own. So, for all practical purposes, it is not wise to ask the user to adjust the thresholds every time he/she needs to test it on a new user with a slightly different accent. This got me thinking:
Can’t this threshold prediction process be automated? In today’s tech-savvy world where robots are supposedly trying to steal our jobs, why are we still performing such redundant tasks!
So, I got to work, and I present to you a python script which can be used to calculate these threshold values autonomously. All you need to do is execute the script and follow the instructions. It will automatically update the kwlist file containing all the required key-phrases and their threshold values predicted to the best of its abilities. Isn't that simple!
You can find this script within the following Github repository:
This post will now try to explain in detail how you can use this script and how it actually automates the process. To test it out, simply clone the repository by copying and pasting this command:
git clone https://github.com/Pankaj-Baranwal/pocketsphinx/
After that, go to misc/ folder within the repo and you will find a set_kws_threshold.py script. Provide the location of your dictionary and default keyword list file as command line argument as follows:
python set_kws_threshold.py <dic location> <kwlist location>
In case you just want to test the script, a default dictionary and kwlist has been provided within the demo/ folder. Open a Linux terminal from misc/ folder and execute the script using the following command:
This will start the script and if all is set up, you will see the terminal producing several output lines similar to this one:
All you have to do now is follow the instructions provided in the terminal window. It would print something like “SAY THE FOLLOWING OUT LOUD AND PRESS ENTER”. So, just speak out the words mentioned below this statement and press enter to continue. The words to be spoken will be taken from the dictionary provided. Every now and then, you will be asked to speak “[RANDOM]” word. This is just to produce some noise and test the script’s accuracy. Once you are finished with this step, the script will start to analyse the speech and try to tune the thresholds so as to minimize false alarms and missed detections. Once the processing is complete, you will be provided with a kwlist containing new threshold values tuned to the best of the script’s abilities.
You can test the output kwlist file on kws mode and a different speech input and see for yourself how accurately the script has predicted the thresholds! In case you still get inaccurate results, please ensure that you are using an acoustic model which is suitable for your accent and that you have configured the decoder properly. If you are still not sure about the process, feel free to contact me or the CMUSphinx organization directly.