DIY Speech Recognition using Raspberry PI & CMU Sphinx

For some time now I have been thinking really hard to build a DIY study aid for children which uses a local speech recognition engine such as CMU Pocket Sphinx and which does not require any cloud services for speech synthesis, for various privacy concerns.

I started to work with Raspberry PI Model 3 which has decent specifications in terms of memory and CPU ( though more would always be better ).

This weblog is just a way for someone to reproduce the results. I faced issues mostly with detecting and figuring out sound cards and with knowing and understanding Linux ALSA configuration and hardware details.

Begin with to see your sound card details and understand ALSA basics.

alsa -l

The Ingredients

  1. Rasberry PI 3 Model ( Raspbian Installed ),
  2. A USB Microphone.
  3. A normal Speaker with 3.5 Audio Jack ( preferably battery operated )
  4. Execute the following command and reboot the system.
sudo apt-get update && sudo apt-get upgrade
sudo apt-get install bison libasound2-dev python-dev swig mplayer -y
sudo reboot

5. Download the CMU Sphinx bundle and build.

cd ~/Downloads
sudo rm -r sphinxbase-5prealpha
tar -zxvf ./sphinxbase-5prealpha.tar.gz
cd ./sphinxbase-5prealpha
./configure --enable-fixed && sudo make && sudo make install

6. Export Library Path

export LD_LIBRARY_PATH=/usr/local/lib 
export PKG_CONFIG_PATH=/usr/local/lib/pkgconfig

7. Do a simple test

pocketsphinx_continuous -inmic yes

8. Download sample data dictionary, open and upload and generate lm and dic files.

how are you
what is the weather in pune
time to office
play national anthem

9. Start Speech Recognition ( Ensure that name of lm and dic file is correct and plughw:0,0 points to your microphone.

pocketsphinx_continuous -hmm  /usr/local/share/pocketsphinx/model/en-us/en-us -lm [FILE_NAME].lm -dict [FILE_NAME].dic -samprate 16000/8000/48000 -adcdev plughw:0,0 -inmic yes  -logfn /dev/null

8. Start saying any one of the lines and wait for it print the recognized line.

Future Plans and Conclusion.

Building this was lot of fun but at the same time it was hard to get things right and took me quite some time to figure out things. In the next part I will cover generating the TTS ( text to speech ) using MaryTTS.

Overall the system is slow as it forks lot of process in the background to recognize the audio and then get the response and play the response. In the future I’m planning to speed this up by using building in Rustlang instead of Python ( I love Python btw) or some other technique.

Also I’m planning to build an installer so that people don’t have to figure all this stuff by themselves if they don’t want to.

You can have a look at the source code for this here

As this is my long term project with only purpose of learning and teaching, I will keep improving it forever. Comments and suggestion are more than welcome :)