Amplify Reach with Automated Dubbing: Introducing Neurodub

Anton Dvorkovich
Published in
4 min readNov 3, 2022


There has been an explosion of online content over the past few years. In 2020, more than 500 hours of video were being uploaded to YouTube every minute. Add to that the myriads of articles and hours of audio published, and you get an unfathomable amount of accessible information. While content becomes more accessible, some barriers remain. English, the most popular language online, represents just 25.9 percent of internet users worldwide. The rest speak other languages: from Spanish to Arabic to Tamil. Communicating with such a diverse audience requires impeccable language skills.

There are tools for text to highlight the important parts and eliminate the rest. With video, it’s more complicated: classical methods are usually expensive and often ineffective. AI localization using neural networks to overcome communication barriers is the only viable solution.

That’s exactly what Neurodub offers: end-to-end smart video localization for 70-plus languages based on human-AI collaboration. In this post, we’ll explain our technology step by step.

How it works

Neurodub’s state-of-the-art AI technology disrupts the traditional localization market by going beyond subtitles and providing voice-overs at low costs. Users can review and edit the resulting video themselves with a simple built-in editor or delegate to a professional: Neurodub’s innovative quality assurance method includes a final review of localized videos by human experts.

Step 1. Intelligent transcription

Quality transcription plays a crucial role in content localization. Our neural networks recognize speech, group words into sentences, and add punctuation. Neurodub considers both the context and intonation to ensure accurate voice-overs.

Step 2. Speaker designation and labeling

When a video has multiple speakers, we have to determine who said what to fully convey the meaning.

Neurodub determines the number of speakers and vocal qualities and distinguishes between female and male speakers. This lets us make our voice-overs as close to the original as possible and is essential in creating accurate translations: in languages like Spanish, the same sentence is translated differently depending on whether the speaker is male or female.

Step 3. Context-aware translation

By 2022, machine translation has become a regular part of life: millions of people use it every day for work, school and entertainment. We take it to the next level.

To create a high-quality translation for subtitles, we need to do more than just look at isolated sentences. Flow and consistency are equally important.

We also have to understand the subject of the sentence since exact words can be translated differently depending on the context.

Neurodub also allows you to add glossaries to aid the translation. This lets its users truly have a hand in the localization. For instance, you can easily change one incorrect term throughout a video by adding the proper translation to the project’s glossary.

Step 4. Neural voice-over

Neurodub goes beyond subtitles: we’ve developed full-fledged neural dubbing, which delivers studio-grade quality using neural networks.

With voice-overs, the original audio track can distract and leave excess noise. Our algorithms remove the original voice and allow for a smooth listening experience. We also try to pick the most similar voices for the dubbing track.

Every localization professional knows that the same text in different languages will be a different length. For example, a translation between English and Italian can deviate by up to 30 percent. For voice-overs, this means the original speech and translation may significantly fall out of sync during playback. To avoid this, we have to synchronize two speech streams.

When we know the length of speech, we can optimize the sound to match the picture. We use timings to generate audio of the desired duration, and by reducing unnecessary pauses between words and phrases, we can achieve synchronization. However, if the voices still don’t sync up, the algorithm will accelerate the speech rate.

Human touch

To achieve the best possible localization, we use our Human touch model: an expertly curated technology for efficient collaboration between humans and machines.

Technology has advanced, and machine translation allows us to consume a lot of content. But machines aren’t perfect; for example, when a character in an English-language film suddenly switches to Latin, the algorithms can get confused. People can easily spot these emerging inaccuracies. Therefore, we verify each step with the help of professional translators and native speakers.

We’ll explore the Human touch model further in our next post. Stay tuned!