SignKit-Learn: Using Machine Learning to converse with a bot in American Sign Language
This was a 36 hour project at HackPrinceton. Our goal was to bridge the communication gap between those with hearing and those without, by creating a tool that would make it easier for people to practice their sign language skills with a bot. The app takes pictures of the user as they make gestures, analyzes the image using machine learning, converts the gestures to the corresponding textual word, and then sends that text to Microsoft’s chat bot. The chatbot then responds with an appropriate message to keep the conversation going.
Picking a dataset
We used Microsoft’s CustomVision to train our datasets and handle API calls for classifying images. Our first plan was to teach the AI how to understand the whole ASL alphabet. We found a dataset online with almost 2000 images, but they were all of the same person in the same setting. This made it very difficult to classify our signs because we were different people and the pictures were in a different setting. Additionally, we had a limit of 1000 images to train and with 26 different classes, we only had roughly 50 pictures per letter. We had an 8% prediction rate with this dataset.
Our 2nd approach was to restrict our domain to only a few words. This would allow us to have more training images for each class. Also, we would generate our own dataset using a variety of people (other hackers at the hackathon) in a variety of settings (against a white wall, green wall, people walking in the background, etc.) to build a more robust classifier. Our training set has 1000 pictures, here’s a few of them:
With our new dataset ready and trained we got an 80% prediction rate!
Sign language isn’t static, what about signs that have movement?
The time constraint didn’t allow for developing a context-aware solution to sign movement so we had to improvise. Our solution was to select “key frames” of a particular sign movement and train our AI on those.
Here’s how you can say no in ASL:
We chose 3 key frames for this sign:
This allowed us to capture any point during the motion of the sign and still correctly classify it.
Not a perfect solution
ASL (like any language) is much more complex than what you can capture in the span of 36 hours. Here are a few improvements we thought of:
- Languages are contextual and the meaning of a sign can change based on what was said earlier
- We had a very limited vocabulary
- People have different “accents” when they sign
- The signs we trained mostly “looked” different so we didn’t have to handle minuscule changes in movement that change the meaning
- Many signs also have facial expressions (like the man shaking his head for “no”) which we didn’t focus on for our training data
Communicating with the bot
We used Microsoft’s BotFramework for developing a conversational bot. After parsing the sign into text, the python server would prompt the bot API for a response. The bot was designed to hold up a simple conversation with you by answering questions and asking a few of its own.
A quick walkthrough
Your webcam captures you signing the word “hello”
The image is sent to CustomVision for classifying and returns a tag. That tag is sent to the bot. The bot responds accordingly.
Looking forward
Since no special hardware is required (just a webcam) and all of the hard work is done by Microsoft’s servers, we can easily extend this into a mobile phone app.
Finally, we would love to extend our dataset and train on a variety of people to improve our classifier. Some variables we would like to consider:
- Varying skill levels (all the images were of beginners)
- Different lighting/camera conditions
- Multiple people signing in one image
Special thanks: Jaison Loodu, Sophia Tao, Rumsha Siddiqui, Aman Adhav