Offline Emotion-Specific Speech-to-Text in Low-End Devices

From virtual assistants that fail to respond appropriately to distressed users, to chat bots that go racist and sexist, there is an increasing urge to embed empathy and emotions in our Artificial Intelligence. Motivated by this, I started this year working on an artificial emotional brain (hardware+software) codenamed “Project Jammin”.

Two components of “Project Jammin” are currently ready, a very basic facial expression detector, and an emotion classifier from text. Since this project is not meant to run on a phone or computer, but instead be a component of any connected hardware (or robot), the big missing part was a speech-to-text interface. In the past few days I have been working to implement this missing part.

After some research, and an attempt to balance performance, speed, and low-resource consumption, I decided to use the popular Pocketsphinx library. This library has been widely used with low cost hardware like the Raspberry Pi (which I am actually using to build my prototype). The installation process was smooth, but once I tested it with the built-in language model, the performance was terrible. The tool could not recognize a single phrase I said correctly.

After some research, I found out how to create my own language models. Since I am interested in a system that can transcribe conversational utterances (as opposed to dictations for instance), I decided to collect a chat log to create my model. After spending some time collecting data I was able to obtain a chat dataset with over 700k sentences. I then trained a trigram language model with a dictionary consisting of the 20k most frequent words. I was very excited, this chat log was big enough to recognize most possible sentences we say in a regular conversation.

After running the code with the new model for the first time, my initial smile faded away quickly. Although the accuracy was much higher than the one with the built-in model, the transcribed text was always a lot different than what I spoke into the mic. After tuning different parameters and testing over and over again, I never attained a descent performance. Out of desperation, I decided to make one last test.

After manually inspecting the dataset, I noticed that there were many sentences that did not really matter in an emotion-detection context. (e.g. “I will see you tomorrow”). With this in mind, I defined a small set of mood-related keywords (happy, afraid, …) as well as some words related to relationship (family, husband, …), and filtered out any sentence not containing the keywords. The result was a smaller dataset of about 5k sentences. Next I trained a new language model. The model was way smaller than the previous one, and only had around 3k unique words, but surprisingly, the recognition rate jumped dramatically.

Although this simple model is unable to detect every single phrase you say, it can recognize in near real-time many, if not most, emotion-loaded key-phrases. Later I will combine this with other input like facial expression recognition (done), and tone of voice detection (future work). The idea is to have different weak detectors working together towards a more robust emotion classification. In a near future, when all the parts are working together, I will let you know if I was right or if I was just dreaming. Meanwhile, check out this video of my speech-to-text + textual emotion classification working together.

For more information about Project Jammin please visit If you enjoyed the post please share (and “like” it).