AI-powered closed caption generator for learning videos at Simplilearn

Aman Jain
Simplilearn Engineering
4 min readMar 24, 2023

At Simplilearn, ensuring the reach of our video content is paramount. That’s why we believe in providing high-quality, accessible learning experiences to our students. Subtitling is a crucial aspect of achieving this goal, but creating subtitles manually can be a significant investment of time and resources. To streamline this process and make our content creation more efficient, we turned to Amazon Transcribe. In this blog post, we’ll discuss our experience with the service and how it has helped us automate our subtitle creation process.

Fig 1: Sample screenshot of our video content with subtitles

Subtitles

The beauty of subtitles lies in their ability to translate audio into a language that is easily readable by learners. Subtitles appear as text displayed at the bottom of the screen during videos and serve as a valuable tool for individuals who can hear the audio but may struggle to understand the narration. By providing subtitles, we can ensure that our content is accessible to a wider audience and promote a more inclusive learning environment.

When we release a course, we understand the importance of having subtitles available in different languages to cater to our diverse student base. However, generating subtitles manually can be a time-consuming task, and it also requires the content creator to have knowledge of different languages and accents. This can be a challenge, as accents can vary widely even within the same language, making accurate translation a difficult task. As a result, manually creating subtitles can lead to delays in releasing high-quality content. To address these challenges, we turned to automation to streamline our subtitle creation process and make our content more accessible to learners worldwide.

We explored many solutions for subtitle creation. Eventually, we choose Amazon Transcribe because of its high accuracy, support of other languages, and flexibility to train.

Amazon Transcribe

Powered by deep learning technologies, Amazon Transcribe is a fully managed and continuously trained automatic speech recognition service that generates time-stamped text transcripts from audio files.

Additionally, it offers other advanced features such as custom vocabulary, custom language models, and vocabulary filtering to provide greater accuracy and flexibility in transcription.

Solution Architecture

Project Architecture
Fig 2: Automatic Speech Recognition Workflow using Amazon Transcribe

To initiate the job on Amazon Transcribe, an API call is made which produces the message to the Kafka cluster. Apache Kafka is a fast, scalable, fault-tolerant, publish-subscribe messaging system used for this purpose. The message contains details of the video, and when the consumer picks it up, it initiates the job on Amazon Transcribe.

Once the job is started, a MySQL table is updated by the backend API call to maintain the history of the job. In addition to this, we use Amazon EventBridge to trigger a Lambda function when the job status changes to “COMPLETED” or “FAILED”. EventBridge is a serverless event bus service that makes it easy to connect services using data from your own applications to AWS services. By using EventBridge to trigger a Lambda function, we can automatically initiate downstream processes based on the status of the transcription job, such as sending notifications or initiating a lambda function.

The Lambda function forwards the request to the webhook along with the job details. If the job is successfully completed, the backend processes the request and maps the generated subtitles with the video in the database. In case the job fails, the topic is sent back to Kafka for reprocessing.

Well, that was simple!

After Amazon Transcribe creates the subtitles, it doesn’t transcribe all the words correctly. Wait? What! Isn’t that its only job?

Courtesy: ABC News

Some of the technical jargon and brand names like Jira, Trello, and GitHub were not transcribed correctly. Those got transcribed as Gira, Grello, Getup, etc respectively.

Custom vocabulary and Custom language models to the rescue

To enhance the accuracy of the transcription, a Custom vocabulary and Custom language model (CLM) were used. In our use case, brand names were defined in the custom vocabularies to ensure their recognition during transcription.

To develop domain-specific speech recognition models, the previously created subtitles were used to train the models. The subtitles were formatted into text files, and the domain-specific models were created using CLM.

Outcome

After processing about 40 GB of content amounting to 120 hours, statistics were generated and analyzed. The results showed that the combination of Custom vocabulary and CLM resulted in 97% accuracy of the generated subtitles, successfully transcribing all brand names and technical jargon.

In conclusion, leveraging Amazon Transcribe for automated subtitle generation has been a successful endeavor. By using custom vocabularies and language models, we were able to increase accuracy and reduce the time and effort required for manual transcription. This has allowed us to focus more on creating high-quality content and providing an improved learning experience for our users.

Thank you for reading!

--

--

Aman Jain
Simplilearn Engineering

Full stack software developer focused on breaking things. Always experimenting.