Few hours ago, I was reading about LipNet. I bet you’d be wondering what the heck is this “LipNet”? Is it some sort of ‘Internet on your Lips’? No! 😄 Let me release some air here. LipNet is a spectacular neural network Lip Reader Artificial Intelligence system, designed by folks from Oxford University and Google DeepMind. The reason it’s mentioned here is because it can read the lips with 95.2% accuracy at sentence level compared to 86.4% of the humans. Now, this might have made you angry because this is yet another task which seemed pretty impossible to be taken over by AI but it did! And, for the folks who are deep into AI, it might be an old news because the research paper was published back in December 2016. You can read more about LipNet here: https://arxiv.org/abs/1611.01599. Not just LipNet, this article would talk about using all sorts of similar AI systems for Transcribing and Translating.
For the folks who don’t know what such a heavy term “Neural Network Artificial Intelligence System” is, let me spare you some gory details of this computer science term. Basically, Neural Network is the digital representation of a biological neural network that us humans and animals have in our brains. That’s how we learn to do things, like everything! And Artificial Intelligence is the digital representation of humanistic intelligence, like deciding something to accomplish something. For example, the way you decide to go for Chocolate over Vanilla ice-cream..? Yeah! So, now if we combine both Neural Network and Artificial Intelligence, what we get is a Digital Human. Seriously, No! Not even close. We are still much-much more superior that computers. Skynet era is coming but it’s too far from now. 😄 Just kidding.
Alright! Back to our topic. So, what am I trying to convey here? LipNet… AI… Movies… What’s going on in my mischievous mind!?
So, I had some idea, like since last year in my mind, but I was not really sure how to describe it, and today, I was reading this article about LipNet and suddenly it was all bright! I read my mind! lol Now, it may sound ridiculous or absurd reading about the idea that I have, but I believe, given enough time and power to AI we can achieve this task. (And, take away yet another struggling human’s job). But I feel this is inevitable. If not me, someone else will get this same idea at some point and eventually, it’s going to happen.
Now, here are a couple of questions for you:
What was the last foreign language movie that you watched with subtitles? Did you wish that it was in the language you speak?
You may have guessed my idea, but for those who are still wondering, here’s my idea: Using Artificial Intelligence and Neural Networks to Translate and Transcribe videos. If you’ve read above about LipNet, you should’ve known that LipNet is something to which you feed a video and the AI reads the lips to determine what the person is saying. LipNet is just one example I’m proposing to use for this purpose, let’s say it’s the hook of the story! In the Computer Science world, there are all sorts of AI: AI to read your emotions, AI to read your lips, AI to transcribe videos, AI to do your school homework (if you’re smart enough 😉), but what if we create a combination of AIs, connect the Neural Networks of the AI which can read your emotions, AI which can read your lips, so and so? The resulting AI would take a video as input, read both the the facial expressions and lips of the person in the video and use one of the closest voice models to translate it into some other language, and also generate subtitles in that language. This would dramatically decrease the cost and time incurred in translating and transcribing a video into a foreign language.
So, the method I propose sounds fairly simple in theory but it is quite challenging in practical, yet I’m very optimistic that it is achievable, perhaps in a couple of years. For the folks who are into developing AIs, I guess, they probably don’t need any help in this section, but for the general public, here are my thoughts on how it can be achieved:
The Input Video: Just for the sake of ease, let’s just say that we have a video of one person talking in English for 3 minutes and we want to translate it to Japanese with Japanese subtitles.
Speech Recognition AI: This AI system will detect and interpret what the person is speaking, for example, “Hello. How are you?”. This system would be only used to grab the words that the person is uttering so it cannot know the depth & breadth of the voice.
Facial Recognition AI: This system would see the facial expressions of the person and interpret emotions in the speech of the person, for example, it will determine whether a sentence is funny or sad or whatever.
Lip Reading AI: This system would read the lips of the person to determine the depth and breadth of the speech, for example, whether a sentence is spoken on high pitch and on slow speed or normal pitch and normal speed or whatever.
Translation Engine: The would be the processing area. It would take the interpreted input from the three AIs, choose a voice model based on most appropriate and similar voice, and create a sentence-by-sentence speech. Basically, it would generate English audio into Japanese audio based on the data from the 3 AI systems.
Transcription Engine: This is the area where the audio would be converted into subtitles. Based on the input from the three AI it would transcribe the speech and also and audio cues such as if the person chuckles or sneezes or gasps or something else. In simple words, it would take the data from the 3 AI systems and create subtitles in Japanese language.
Output Video: After passing through both the engines, the converted audio and the subtitles would be superimposed on the original audio to create a transcribed and translated English to Japanese video.
One of the biggest challenge in this process is the translation. As you may know, a sentence in English is spoken in a different way than a sentence in Arabic. It can also be shorter or longer sentence when translated, having fewer or more words than English language. Then there are some sentences and words which if simply translated in another language, mean something totally different than what they are supposed to mean. Also, with the current state of AI, as much as I’m aware of, it is quite impossible for an AI to differentiate and understand two or more than two people speaking simultaneously. However, I believe, it’s just the matter of time when the AIs learn to differentiate, understand and interpret multiple people’s speech.
The Use Cases
This type of interconnected system would potentially be helpful in dubbing movies in different languages, given the capabilities of the fully developed AI system, or it can be really helpful in translating any video into any language, for example a translating and transcribing a TED Talk video from English to Spanish. This type of system can also be helpful in translating a live speech into any language. A possible scenario can be a United Nations meeting where a representative from, let’s say India speaks in Hindi, and through the use of this system, the speech gets live translated into all the languages of all the representatives of different countries.
One key difference of this system from the current live translation systems would be the accuracy of the speech, pitch, depth & breadth of speech, emotions of speech and the live transcription of the speech.
Hope for the future
I really believe that anything can be a possibility by seeing the current trends in AI, given enough training and time to the Neural Network and the AI to develop. I also feel that someone has probably already started working on something similar, especially companies like Google and Amazon, who just amazed us with different voices in their Assistant systems and their naturalistic voices.
I hope you really liked my idea and probably if you were looking for some inspiration to begin your work, perhaps I might have inspired you.
Thanks for reading!