AI Translation with the Human Touch

Olga Suvorova
Published in
4 min readDec 13, 2022


In our last post, we teased our Human Touch model. Now it’s time to explore it in depth: why do we need experts in addition to AI and how does Neurodub implement this collaboration?

According to a survey conducted by Tidio, 42 percent of respondents believe AI could replace translators. At Neurodub, we don’t think it should come down to a choice between humans and machines. After all, why choose one when we can have the best of both worlds?

Whether it’s written or spoken, language is alive. It’s constantly evolving, making it hard for machines to capture every nuance. With Human Touch, humans help improve the result by highlighting subtleties in language and new terminology, dealing with different dialects and cultural references.

What is Human Touch

To ensure that our video recognition and translation quality meet customer expectations, we use an expertly tailored technology that facilitates efficient collaboration between humans and machines. When building our pipelines, we rely on the expertise of both professional linguists and ordinary people from different backgrounds. This allows us to scale up while maintaining quality.

Our Human Touch model works well for subtitle translation and generating voice-overs, but let’s focus on subtitles first.

How neural subtitle translation works

We start by sending the audio to the speech recognition and translation models to get an automatically generated subtitles draft. Then, we put together a glossary for the project. Why? Firstly, glossaries help maintain consistent spelling within a project. This could be necessary when a word has multiple accepted spellings, as is the case with gray and grey or analog and analogue.

Some names and terms can be difficult to make out phonetically. Others might be freshly-invented terms or new brand names. When a model encounters an unknown word, it could make a mistake recognizing or translating it. With a glossary, users can easily swap in the right word without wasting time searching for the correct spelling. For example, when translating Korean movies and TV series, names have to be written according to specific rules. The same goes for English: automatic speech recognition could fail with certain words and names. A glossary certainly wouldn’t hurt when dealing with porpoise, Worcestershire, or Coosawhatchie River since their pronunciation doesn’t follow standard spelling rules.

If there are no specific requirements from the client, we rely on Netflix’s guidelines, one of the golden standards for the media industry. Usually, guidelines consist of myriads rules, from certain markup to strict adherence to timings. For instance, an ellipsis indicates a long pause or an end of a dialog.

Next, we cut the video file into short segments and pass them along to human experts. Each expert sees a segment and the corresponding AI-made text for the future subtitles. Using their linguistic expertise and common sense, they edit subtitles following our thorough guidelines. Experts correct spelling and punctuation, split or merge adjacent phrases, check toponyms, and more. To scale the process, we divide the task into separate assignments, which are then given to multiple experts to perform. This lets us process a bigger workload since it’s not just one person reviewing an entire text; it’s a whole team.

After Human Touch, we automatically align text and restore the timings. In the last step, the finished subtitles are proofread for compliance with the client’s requirements. At all stages, we use selective control to maintain quality.

What we get in the end

The Human Touch model has proven itself even in complicated cases. For example, when a character in a fantasy TV series casts a spell in Latin, ASR will transcribe the original track while human experts verify the text with Latin sources. Beyond that, humans can understand when someone intentionally distorts words to parody a strong accent. It makes the translation more natural and improves the quality of voice-overs.

In Neurodub, the Human Touch plays a dual role. First, human experts help algorithms learn, improving them with every project. This makes our “AI-humanizing linguists” the true co-creators of the final product. Second, the model allows our clients to outsource the dubbing process altogether. All they have to do is send us a video, which we automatically translate and review with our human experts. These two approaches allow us to find the optimal solution to any problem.



Olga Suvorova

Head of Crowd Localization Solutions @Neurodub