Video Dubbing using AI

Dattu Burle
7 min readMar 16, 2020
Dubbing is an ART
chathur 3.0: Then I'm pro-Artist, I can Dubb in any language

Today there are 6800 different languages spoken across the world and in the increasingly globalised world, nearly every culture has interactions with other cultures in some way that means there is an incalculable number of transactions requirements every second of every day across the world.

Many people are missing good content from social media due to the language barrier. But with the help of our Bot, one can convert any language to their desired language just by uploading the video, selecting the original language(input language) and then selecting the desired language(output language).

Following are the sections in the Blog, so you can be able to jump around and focus on just the one(s) you are interested in without problems.

  • Goals and objectives of the VideoDubbing Bot
  • About Recurrent Neural Network
  • About Lstm
  • About Seq2Seq Model
  • Attention Mechanism in Neural Networks
  • Flow chart of the entire process
  • The approach we have done for language Translation
  • Running of a VideoDubbing Bot
  • Demo of Orginal Video and Dubbed Video

Goals and objectives of the VideoDubbing Bot

  • To break the language barrier.
  • Make the system extensible — As of now, we are targeting the educational videos and news videos. In future we targeting to Dubb any type of videos including movies.
  • Make the system very easy to use — Just come to our platform, upload the video, select the ordinal language, select the desired language and finally download the converted video.

Recurrent Neural Network (RNN)

Before Knowing about what RNN is, Let’s start with Language Models. The main goal is to predict the next word’s with the help of previous words. In the past, it is solved using the n-gram model. It is a window-based approach where it will consider the n-1 previous words and find the probability of the next coming word. If we consider a small window then we will miss the context of the sentence. So, if we use a large window then that data might not be present in our dataset and this doesn’t work well when the given sentence is out of the dataset. To solve these problems RNN comes out.

Recurrent Neural Networks or simple RNNs are a special kind of neural network that is capable of dealing with sequential data, like videos(sequence of frames) and more commonly, text sequences or basically any sequence of symbols. The beauty of it is, the network doesn’t need to know what the symbols mean. It will infer the meaning of symbols, by looking at the structure of the text and relative positions of symbols. There are some amazing articles on RNNs and what they are capable of. To put it simply, an RNN, unlike an MLP or CNN, has an internal state. Think of this state as a memory of the network. As the RNN devours a sequence (sentence), word by word, the essential information about the sentence is maintained in this memory unit (internal state), which is periodically updated in each timestep.

Long Short Term Memory (LSTM)

The naive version of RNN is typically called a Vanilla RNN, which is pretty pathetic in remembering long sequences. There are more complex versions of RNN, like LSTM (Long Short Term Memory) and GRU (Gated Recurrent Units) RNNs. The only difference between a Vanilla RNN and LSTM/GRU networks is the architecture of the memory unit. An LSTM cell consists of multiple gates, for remembering useful information, forgetting unnecessary information and carefully exposing information at each time step.

Seq2Seq Model

The Sequence to Sequence model (seq2seq) consists of two RNNs — an encoder and a decoder. The encoder reads the input sequence, word by word and emits a context (a function of final hidden state of the encoder), which would ideally capture the essence (semantic summary) of the input sequence. Based on this context, the decoder generates the output sequence, one word at a time while looking at the context and the previous word during each timestep. This is a ridiculous oversimplification, but it gives you an idea of what happens in seq2seq.

Attention Mechanism in Neural Networks

A recurrent neural network or its variant (LSTM/GRU/..) are means of learning from sequence using back-propagation through time. In this case, generally, the sequence is fed to the network once. There are different elements of the network which control how much of what information is carried along through long sequences.

Imagine you were given a long paragraph of text. After reading it you would be asked some fact-based questions. Once you see the question you realize that you don’t remember the fact in question. What do you do? You go back and re-read this time paying ATTENTION to what question demands. In no time you spot the answer. Attention mechanism in the neural network works similarly. The input data (text/image) is fed to the network (which may be an [bi-directional] LSTM/GRU). In a question processing unit representation of question is also fed. Now along with the question vector, another unit re-iterates over the input data (text/image sequence) trying to learn an importance co-efficient for chunks of this data. In most simple terms looping over the input data second time onwards trying to learn this importance co-efficient is ‘Attention (you can see why from the long paragraph analogy) mechanism’.

Flow chart of the entire process

The approach we have done for language Translation

A popular approach was to break the source text into segments, then compare them to a bi-lingual corpus using the statical evidence and probability to choose the most likely transactions.

Nowadays the most used statical translation system in the world is Google Translator. That’s why we are using the Google API.

Google uses deep learning to translate from a given language to another with state of art result. Google published a paper discussing a system they integrated into their translation service called Neural Machine Translation. It’s an encoded decoded model which is inspired by similar work from other paper on topics like text summarization.

So where ever as Google translator would translate from language A to language B with this new NMT(Neural Machine Translation), it can translate directly from one language to other. It doesn’t memorize phrase to phrase translations instead it encodes the semantics of the sentence.

This encoding is generalised so it can even translate between language pair like Japanese to Korean that it hasn’t explicitly seen before.

But we can’t use LSTM recurrent network to encode a sentence in language A, even tho the RNN spits out a hidden state’s which represents the vectorised contents of the sentence and then we are feeding to the decoder which will generate the translated sentence in language B, by word to word. Because the drawback to this architecture is limited memory. The hidden state’s of LSTM is where we are trying to cram the whole sentence we want to translate. We can increase the hidden size of the LSTM, after all, they are supposed to remember long term dependencies but what happens is the hidden size of the LSTM increases then proportionally training time increases exponentially.

So that we brought Attention into the matrix. This increases the storage of our model without changing the functionality of the LSTM.

We built our model using TensorFlow’s built-in embedding attention seq to seq function.

Running of a VideoDubbing Bot

Download the code form —

https://github.com/DattuBurle/Video-Dubbing-using-AI/blob/master/main.py

https://github.com/DattuBurle/Video-Dubbing-using-AI/blob/master/audio_text.py

Thanks to Vipul Sir and Siraj Raval for sharing knowledge.

🧐!!! Open to all Doubts

Thank you!!!

--

--