Intelligent Sign Language Translation
We are NEST, or otherwise smartly expanded as, the No Eat Sleep Team, and we built this product we would like to call… SignSpeak. It was an idea that was born out of a hackathon, and we’ve been building on it ever since, and with this article, I plan to share our journey with you.
What is SignSpeak?
It has been this idea embedded in our heads from a very long time, but we never really thought it was logically possible given the technology we had access to. It’s pretty simple on the front end of things, but complicated in the backend, which I will try to simplify as much as possible to you in this article.
SignSpeak is an Intelligent Sign Language Translation System. We essentially wanted to translate Sign Language to plain English, but instead of using fancy gadgetry and electronics, we wanted to rely on the simple (yet not so simple) concepts of Computer Vision, Machine Learning, Neural Networks and built in cameras to identify live stream of images to translate to.
Where did we see SignSpeak being used?
Our end goal was a over ambitious one: to set up SignSpeak in most government offices or schools with just a computer and camera attached to it, which would translate Sign Language in real time to those who didn’t understand it. It works out to be extremely easy to setup & use and with absolutely no overhead costs, which are two of the most sought out words the government looks for.
SignSpeak was designed to be used anywhere. For the hackathon, we deployed it on an Raspberry Pi, on our own web platform and we also made a Android app that would also make use of this data we collected.
Another use case we had for the product was that, if a mute person was to meet someone else on the street and he wanted to communicate with him, but the other person is unaware of the intricacies of sign language, the mute person could just whip out his phone, open the app, make the camera point at him and the phone would simply read out the text to you. It’s really that simple.
And this is all possible because of pre-trained data. For the hackathon purpose, we selected 8 words that were distinctive from each other, started shooting images of these signs that we enacted, on our laptop webcams, on our phones, on our DSLR. Every source we could find, and we filled it with data of all our friends. Using this small data set, we were able to achieve an accuracy of ~84% on our first try.
All this sounds interesting, but what’s the tech behind it all?
SignSpeak was built completely using open source software and frameworks available on the internet. We’ve completely built the backend on Python, because it is the go-to language when it comes to research and data science.
We used an open source neural network framework called TensorFlow, by Google, which allowed us to scaffold an artificially intelligent image classification script within minutes.
We built a Convolutional Neural Network that contains 48 hidden layers to feed our images forward in order to classify it into one of our eight different words that were chosen to be classified.
Our next idea was to use something known as a Recurrent Neural Network, which is widely used in most of our digital assistants (like Siri, Google Assistant or Alexa). With this, we would be able to string together an entire conversation of sign language and translate it in real time. We weren’t able to implement this because of the lack of data we could shoot at such a short time, but the model showed a lot of promise in theory.
And of course, on the mobile front, we made use of Android and the NDK to support C++ development natively on the device. This allowed us to use the pre-trained data on any mobile device seamlessly.
So, what makes you so special from the existing solutions out there?
We’re not trying to say that we just invented fire, or that we reinvented the wheel. This problem of ignorance towards Sign Language is rising in all of us as each generation passes, and it’s been around for ages. There have been a lot of solutions that have come up to tackle this issue, some of them which we saw right there at our hackathon.
The first inspiration to this idea we had, and also the first time we saw this problem being tackled publicly, was here on this video:
For those who are not able to watch the video, it’s an invention from Lamelson MIT where two students built these gloves that can detect hand gestures and speak out loud the words they were trying to sign.
Now, we could argue that this hack was grossly overfit with data to make it sign perfectly, but it’s a hackathon, and that’s just normal. But one thing we noticed from this, and all the other ingenious glove inventions we saw at the hackathon was that, this could only tell you information about hand gestures, not about facial markers.
One simple example is that, in American Sign Language shaking your head while signing something indicates a negative of that word. Another example in Indian Sign Language is that, when you point at your moustache, that can indicate man, but if you point at your nose, that would indicate woman. The gloves cannot impart that meaning.
This is where SignSpeak steps in, because we don’t just see the hand gestures, we see everything. The intricate details and features made by your facial expressions all get captured through the camera and are matched with our training data through a series of neural network layers.
I think I’ve taken up enough time explaining our really simple idea to you. The only reason we wanted to share this idea with the world was that, there could always be innovation anywhere in the world. We found out a small solution that could solve one part of the problem, and we know it’s not perfect, but that’s why we want to share it with the world. So someone else could stumble upon this and build a much better solution.
We all contributed to the project so that it could be the best version it could be, and we are extremely proud to have won 3rd Runner’s Up at the Smart India Hackathon 2017 - Hyderabad with this idea and only hope that this doesn’t end here. There is still so much that can be done in the field of computer vision and computer science that can help mankind and humanity. Let’s all realize that as soon as we can.