“Okay Google” mode in pocketsphinx

Pankaj Baranwal
5 min readJul 13, 2017

--

Up till now, we have covered keyword spotting mode whose explanation along with a video demonstration can be seen in my previous article. In the video, I had used simple, monotonous commands like “FORWARD”, “BACK”, “LEFT”, “STOP”, “FULL SPEED” etc.

As a starting point, this is fine. But if we want speech recognition to actually be used in real world scenarios where human speech is to be decoded into text, we need more generic ways which work on entire phrases instead of just keywords.

Key phrases could be a good way to activate or “wake up” a machine from its standby state, but after that, it would be more logical if the machine could understand entire sentences and deduce the corresponding information as another human would. This is the main motivation behind implementation of continuous mode (or as I refer to it, “Okay Google” mode) in pocketsphinx.

If you have used an Android phone before (It is highly likely that you are currently reading this post on one!) then you must also be familiar with Google’s “Ok Google” service which can be used as a command to activate continuous listening mode for quick and hands-free Google searches. It always makes me appreciate the appeal of such fast advancing technologies! ;)

Well, you can have a similar feature implemented offline using our pocketsphinx package in ROS! You can find the source code to this package right here:

Now, if you do not know much about ROS or have landed here because you thought this was a cool article (thanks by the way!), you should visit this link and get yourself familiar with ROS basics because this package is entirely based on ROS (Robot Operating System).

So, there is a launch file with the name continuous.launch within the /launch folder in the repository which uses the following three nodes to implement this “Okay Google”-like feature:

  1. Audio_control: This node is used to send the speech input on a ROS topic.
  2. Kws_control: This node is used once in the start to detect the initial “wake up” key phrase from the speech input.
  3. asr_control: This node has the impementation of JSGF grammar (Java Speech Grammar Format) and language model mode either of which can be used for detecting entire phrases. If you are going to use continuous mode, a predefined grammar or an ARPA language model needs to be provided. Further details regarding this mode can be found here: https://cmusphinx.github.io/wiki/tutoriallm/

If you just want to see this feature in action, you should use the following command which will execute the launch file and pocketsphinx will start to function in keyword spotting mode. As soon as it detects a keyword, like “JARVIS” (Yeah! I am on Team Iron Man!), it will transition into continuous speech recognition mode where it will start recognising entire phrases like: “Go to the kitchen”, or “Bring me coffee”, or “Click a picture” etc. These are much larger than simple keywords and so keyword spotting mode cannot be used in such cases.

roslaunch pocketsphinx continuous.launch dict:=/home/pankaj/catkin_ws/src/pocketsphinx/demo/asr.dic gram:=/home/pankaj/catkin_ws/src/pocketsphinx/demo/asr rule:=rule kws:=/home/pankaj/catkin_ws/src/pocketsphinx/demo/asr.kwlist grammar:=asr

Now, if you need to modify this mode to suit your needs, you can do it very easily! All you need to do is provide your version of keyphrase (or keyword list) and dict file in the above command so that the initial detection of key phrases is successful and then create a new grammar file (with extension .gram) similar to the one available in the /demo folder of the package. If you want to use language model instead, you should instead add that as the argument. It is actually preferred to use language model. Here is a detailed description of ARPA language model recommended by CMUSphinx:
https://cmusphinx.github.io/wiki/arpaformat/

If you are not able to properly set the thresholds of keyphrases in the kwlist file, you can also use this newly developed python script which tries to tune the thresholds as best as it can. But please note that it is still in initial stages of development.

As for the development of the gram file is considered, you can either get detailed information on the W3C Note or just modify the voice_cmd.gram file present in the /demo folder of the package mentioned above.

Cotnents of asr.gram file

In the image above, the seventh line mentions the name of the grammar: “asr” which had also been used as the value of a command line argument above. The ninth line mentions the actual rule which is public in nature, i.e., can be accessed from anywhere. Its value includes the possible combination of command phrases which the user can use. On the right hand side of the equivalence, the words in <> refer to some other rule like <action> which has been described later on in the file. Parts within [] are optional. User can choose to speak or omit those words. In line eleventh, | symbol has been used between words which implies “or” symbol. So, only one of these phrases will be considered.

Hence, some of the possible phrases which the system will recognize include:

  • Bring me coffee from the kitchen
  • Wash and dry my clothes
  • Get me to Avengers’ base!
  • etc.

You can, of course, add more rules and make it as diverse as you want the system to be. And that is it! You have successfully developed your own version of “Ok Google”! But wait! One thing is still missing:

How do I create the language model? You never mentioned anything about that.

Ah! Good question. There are many methods which can be utilized for building a language model based on your set of commands. Here is the easiest procedure:

  • Create a corpus.txt file and fill it with the phrases you want the system to recognise, one phrase per line.
  • Go to LMTool page.
  • Simply click on the “Browse…” button, select the corpus.txt file you created, then click “COMPILE KNOWLEDGE BASE”.
  • You should see a page with some status messages, followed by a page entitled “Sphinx knowledge base”. This page will contain links entitled “Dictionary” and “Language Model”.
  • Voila! You have it ready!

For in-depth information about language model, follow this link.
There is also the ARPA language model docs available here for the inquisitive.

So, we have come this far. Let’s discuss my next goals for this project. Hopefully, you will be able to give some valuable feedback! So, I have started working on a speech verification system using Guassian Mixture Models which would help in determining whether input speech was spoken by a specific person or not. I am hoping to have a primitive version ready by end of next week. If this is successfully implemented, it could be an add-on feature of the pocketsphinx package and help add a lot more to the many utilities that current pocketsphinx package already offers! So, Stay tuned!

Happy Coding!

--

--