Getting Alexa to Respond to Sign Language Using Your Webcam and TensorFlow.js

TensorFlow
Aug 8, 2018 · 11 min read

Early Research:

The broader pieces of the system I wanted to put together for this experiment were quite clear in my head early on. I knew I needed:

  1. A text to speech system to speak the interpreted sign to Alexa
  2. A speech to text system to transcribe the response from Alexa for the user
  3. A device (laptop/tablet) to run this system and an Echo to interact with
  4. An interface that ties this all together

Enter TensorFlow.js:

The TensorFlow.js team has been putting out fun little browser based experiments both to familiarize people with the concepts of machine learning and also to encourage their use as building blocks for your own projects. For those unfamiliar with it, TensorFlow.js is an open source library that allows you to define, train and run machine learning models directly in the browser using Javascript. Two demos in particular seemed interesting starting points — Pacman Webcam Controller and Teachable Machine.

  1. I could use TensorFlow.js to run models directly in the browser. This is huge, from a standpoint of portability, speed of development and ability to easily interact with web interfaces. Also the models run entirely in the browser without the need to send data to a server.
  2. Since it would run in the browser, I could interface it well with Speech-to-Text and Text-to-Speech APIs that modern browsers support and I would need to use.
  3. It made it quick to test, train and tweak which is often a challenge in machine learning.
  4. Since I had no sign language dataset, and the training examples would essentially be me performing the signs repeatedly, the use of the webcam to collect training data was convenient.
  1. Since kNNs aren’t really learning from examples, they are poor at generalization. Thus the prediction of a model trained on examples made up entirely of one person will not transfer well to another person. This again was a non-issue for me since I would be both training and testing the model by repeatedly performing the signs myself.
  2. The team had open sourced a nice stripped down boilerplate of the project which served as a helpful starting point.

How it works

  1. Once training is complete, you enter predict mode. It now uses the input image from a webcam and runs it through the classifier to find its closest neighbours based on the training examples and labels provided in the previous step.
  2. If a certain prediction threshold is crossed, it will append the label on the left hand side of the screen.
  3. I then use the Web Speech API for speech synthesis to speak out the detected label.
  4. If the spoken word is ‘Alexa’ it causes the nearby Echo to awaken and begin listening for a query. Also worth noting — I created an arbitrary sign (right fist up in the air) to denote the word Alexa, since no sign exists for the word in ASL and spelling out A-L-E-X-A repeatedly would be annoying.
  5. Once the entire phrase of signs is completed, I again use the Web Speech API for transcribing the Echo’s response which responds to the query clueless to the fact that it came from another machine. The transcribed response is displayed on the right side of the screen for the user to read.
  6. Signing the wake word again, clears the screen and restarts the process for repeated querying.
  1. Adding an entire catch-all category of training examples which I categorized as ‘other’ for idle states (the empty background, me standing idly with my arms by the side etc). This prevents words from being erroneously detected.
  2. Setting a high threshold before accepting an output to reduce prediction errors.
  3. Reducing the rate of prediction. Instead of predictions happening at maximum frame rate, controlling the amount of predictions per second helped reduce erroneous predictions.
  4. Ensuring a word that has already been detected in that phrase is not considered for prediction again.
  5. Since sign language normally ignores signing articles and instead relies on context to convey the same, I trained the model with certain words that included the appropriate article or preposition e.g the weather, to the list etc.
  1. The second option is to have the user sign a stop-word as a deliberate way of letting the system know they are done with their query. On recognizing this stop-word the system can trigger transcription. So a user would sign Wakeword > Query > Stopword. This approach runs the risk of the user forgetting to sign the stop-word entirely leading to the transcription not being triggered at all. I’ve implemented this approach in a separate github branch where you can use the wake word Alexa as bookends to your query i.e “Alexa, what’s the weather in New York (Alexa)?”.
  1. Using the CNN based approach (like the pacman example) might improve accuracy and make the model more resistant to translational invariances. It will also help to better generalize across different people. One could also include the ability to save a model or load a pre-trained Keras model, which is well documented. This would remove the need for training the system every time you restart the browser.
  2. Some combination of a CNN+RNN or PoseNet+RNN to consider the temporal features might lead to a bump in accuracy.
  3. Use the newer reusable kNN classifier that’s included in tensorflow.js.

TensorFlow

TensorFlow is an end-to-end open source platform for machine learning.

TensorFlow

Written by

TensorFlow is a fast, flexible, and scalable open-source machine learning library for research and production.

TensorFlow

TensorFlow is an end-to-end open source platform for machine learning.