How A.I. will disrupt the Dubbing Industry

What I love most thinking about AI, is the power of the real-world applications that will be available down the road with this technology.

I understand that the ideas hereby presented aren’t as easy to produce as they sound, but I believe they are doable in the mid-term future.


Step 1: Removing words of the script from a movie, having as input the original audio file and the written script. 
This process might involve AI to look for the specified words in an audio track, which is already done by smart assistants, and to remove them without ruining the background sounds, which is really the only hard part.
Today’s method for reconstructing the audio without the original words, to use for dubbing, is tedious and time-consuming. We currently use some parts of the background noises from silent moments and paste them as background when they talk. An AI could do something similar.

Step 2: Translating the movie script (using AI, why not?). 
Today’s translation services are pretty good, but they will continue to evolve to perfection.
I believe this technique will be mastered at human level in about 5 years.

Step 3: Mimicking the actor’s voices and pronouncing the words of the translated script. 
This happens using the voices of the actors from the original audio to train an algorithm to mimic their voice and feed the translated script to the new voice to obtain the translated audio track. At this point, we have the translated sentences not applied yet to the original audio.
You might think that this step is further away than it really is. The startup Lyrebird has already got the capability to copy your own voice if you feed it a handful of phrases — but it is still a proof of concept and not accurate enough for this kind of application — yet.

Step 4: Applying the generated sentences to the original track, making sure that the intonation, accent, speed of speech, etc. look the same on both tracks. 
This is the hardest step in the whole process. It must be done from scratch and maybe the bottleneck of this application. One idea may be to get the timestamps of every start and end of a sentence in the original track and use it to fit the same sentences translated in the new track, but this would need a lot of tweaking in order to feel natural.

Step 5: Synchronising the video to suit the new audio better. 
This includes generating extra frames to adjust the length of the speech in different languages and using deep-fake video technology algorithms already existent to sync the lips of the actors to the new track. This is the last layer to add fidelity to the dubbing, and it means that watching the translated movie would have no noticeable difference to the original.
An adversarial network could be deployed to measure the performance of the dubbing process, training a model to distinguish between an original video and an automatically dubbed one.

I don’t know how much time is necessary for such a process to be completed, but I anticipate in the next 10 years we’ll have all the building blocks ready for this kind of exciting application.

Of course, this could be also used for Youtube Videos, for example, you could watch Youtube in your own language and videos in all other languages would be automatically dubbed. This would require an extra step, which is extrapolating the script from each video in text form, but there are already algorithms with this capability, including Youtube’s proprietary one, that also uses AI.

If you liked this article, please leave it a like and share with your friends.
I hope you enjoyed reading it as much as I enjoyed writing it.