AI Automated dubbing : Building an end-to-end Youtube audio translation platform using AWS serverless architecture
Every once in a while we might have faced this problem that an interview of our favourite star or an important documentary is on Youtube but it is in a language that we don’t understand.
What if I told you that I will give you all the tools and in one click you can get a new video with audio converted to the language of your choice. It also preserves pauses and matches the utterance length of the original language.
Here is a demo (playlist of 4 videos) :
Here we are converting an English (US accent) documentary into different languages (Spanish and Russian) different accents (English — Indian).
1st Video : Original English Audio (American Accent). 2nd Video : Translated English Audio (Indian Accent). 3rd Video :Translated into Spanish Audio. 4th Video : Translated into Russian Audio
If you execute well , you can even build a startup out of this :)
Be prepared to adapt code and encounter any minor errors as you need to host lambdas (code provided) , create S3 buckets, create credential files etc to get it working on your end.
All the code necessary with deployable lambdas and well commented “ReadMe” can be found here :
Imagine watching any youtube video (eg: Documentary) with audio in your native language irrespective of what its…
The input to the program is a youtube video url link in any language. Works best for single speaker videos (Eg: Documentaries, Monologues etc). Currently only English and Spanish input languages are supported by Amazon Transcribe. So input youtube url link should be in English(US) or Spanish(US).
But if you adapt this to Google cloud services you could work with many other languages.
The output is a downloaded video with the audio track modified in the language of your choice. You have seen this in the above demo video.
Services used: Amazon Transcribe, Amazon Translate, Amazon Polly, Lambda (3 different S3 lambda triggers).
Step 1: Once a youtube url is entered the video file(.mp4) is downloaded locally , audio file (.wav) is separated from it. Only the audio file is uploaded to S3 bucket.
Step 2: As soon as the audio file(.wav) is uploaded to S3 bucket, the S3 trigger starts a Amazon Transcription job in lambda. This is the first lambda function. Shown in lambda1_s3_trigger_transcription folder in the code repository.
Step 3: Once transcription is done, the output json file is stored in another S3 bucket. A new S3 trigger starts a lambda function that converts json to subtitle(.srt) and also uses Amazon Translate to convert the original transcribed subtitle(.srt) file to translated subtitle (srt) file. This is the second lambda function. Shown in lambda2_json_to_srt folder in the code repository.
Step 4: Once the translated srt file is available in S3 bucket, an S3 trigger starts another lambda function. This lambda function does two jobs.
4.1) First it uses the translated subtitle file to mute out the audio in the original audio file(.wav) at the time frames available in translated subtitle file. This essentially mutes the voice in the original audio file and preserves any other areas where there was music or background noise.
4.2) Using Amazon Polly, voice is synthesized for every line in the translated subtitle file and overlaid onto the original voice muted audio from step 1 above. Here “max duration” parameter from ssml is used to make sure that the translated and synthesized voice of polly doesn’t exceed the duration of the original language subtitle duration. This is the third lambda function. Shown in lambda3_srt_to_audio folder in this repository.
Step5: The final audio file(.wav) with polly synthesized voice overlayed is downloaded. The audio in original video is replaced with this translated audio and a new translated video(.mp4) is generated locally.
Happy coding ! If you have any questions reach out to me at : firstname.lastname@example.org