A serverless solution for on-demand subtitle generation

Flavio Maggi
Storm Reply
Published in
7 min readDec 6, 2023

In today’s fast-paced digital world, reaching a global audience is more crucial than ever before. Imagine a tool that not only breaks language barriers but also ensures your content is accessible to every viewer, regardless of their hearing abilities or language proficiency.

Let’s try to imagine a company with thousands of video contents accumulated over the years concerning marketing campaigns, training videos, live streams, etc.
The idea is to improve the accessibility of this material with the aim of reaching a wider audience and generating profit from the archive footage.

Paying a professional translator (or a whole team of translators in this case) to manually transcribe each saved video and create the subtitles would certainly be a very slow process as well as extremely expensive for every client.

How can we automate this process easily, efficiently, and cost-effectively?

The goal of this article is to analyse together a simple cost-effective solution that can be easily integrated into existing systems and that exploits the flexibility of the Serverless paradigm and the power of Machine Learning to automate the subtitles generation process from existing videos in an on-demand mode.
All using technologies provided by AWS.

Technical Introduction: Meet the heroes!

In this section, we will focus on giving a proper introduction to the AWS services on which our solution is based.

The solution exploits some of the core services of the AWS serverless offering.

Here is a brief introduction to our heroes (at the end of the article you will find links for more in-depth information):

  • Amazon Simple Storage Service (Amazon S3) is cost-effective object storage service that offers industry-leading scalability, data availability, security, and performance.
  • AWS Lambda is a serverless, event-driven compute service that lets you run code for virtually any type of application or backend service without provisioning or managing servers.
  • Amazon DynamoDB is a fully managed, serverless, key-value NoSQL database designed to run high-performance applications at any scale.
  • Amazon Simple Notification Service (Amazon SNS) sends notifications two ways, A2A (application-to-application) and A2P (application-to-person). A2A provides high-throughput, push-based, many-to-many messaging between distributed systems, microservices, and event-driven serverless applications. A2P functionality lets you send messages to your customers with SMS texts, push notifications, and email.
  • Amazon EventBridge is a serverless event bus service that makes it easy to connect your applications with data from a variety of sources.

Instead, we dwell for a moment more on two niche services: AWS Elemental MediaConvert e Amazon Transcribe.

AWS Elemental MediaConvert is a file-based video processing service that provides scalable video processing for content owners and distributors with media libraries of any size. MediaConvert offers advanced features that enable premium content experiences.

In this solution, we will use one of the various features made available by this service, in particular its ability to extract audio from videos.
You can use AWS Elemental MediaConvert to create outputs that contain only audio, without video, supporting various input and output formats (Reference: click me!)

Amazon Transcribe is an automatic speech recognition service that uses machine learning models to convert audio to text. You can use Transcribe as a standalone transcription service or to add speech-to-text capabilities to any application.

Again, we will focus on one of the various features offered by the service.
Amazon Transcribe supports WebVTT (*.vtt) and SubRip (*.srt) output for use as video subtitles. You can select one or both file types when setting up your batch video transcription job. When using the subtitle feature, your selected subtitle file(s) and a regular transcript file (containing additional information) are produced. Subtitle and transcription files are output to the same destination.

Subtitles are displayed at the same time text is spoken, and remain visible until there is a natural pause or the speaker finishes talking.
(Reference: click me!)

Technical Deep-Dive: The Architecture one step at a time

OK, now that we know the main actors in our solution, let’s get into the infrastructure and the flow that the requests for the processing of our subtitles will follow.

On-demand Subtitle Generation Architecture

The starting point of the architecture is an Amazon S3 Bucket where the content of interest will be placed.

The employees of the company interested in the solution, depending on their needs, will be able to upload the videos by integrating the Amazon S3 API in a web application or using scripts through AWS CLI or AWS SDK available commands.

The triggering event of the stream, as anticipated, is the upload to Amazon S3, which will automatically trigger an AWS Lambda function that will have two tasks:

  • Start audio extraction from the newly uploaded video by interacting with the AWS Elemental MediaConvert service.
  • Save the initial status of processing progress within a dedicated DynamoDB table (this small expedient will facilitate monitoring in case of errors/problems in the flow!).

At this point, the ball passes to one of the main players in the architecture: AWS Elemental MediaConvert, whose powerful functionality will be used to process and separate the audio track from the source video.

MediaConvert, once it finishes processing, will then load the audio extracted from the content into Amazon S3. This will act as trigger for the next steps in the flow, once again using AWS Lambda as glue between the various services.

Now that we have our audio file ready, the second player in our architecture enters the field: Amazon Transcribe.
Transcribe will simply retrieve the audio track just generated by MediaConvert from S3 and, leveraging the power of Machine Learning, will automatically create the necessary subtitles in the chosen format usable later by the system user.

Well, now we finally have our subtitles! All that remains is to notify our user.

When a job’s state changes from IN_PROGRESS to COMPLETED or FAILED, Amazon Transcribe generates an event. This event will then be captured and processed by Amazon EventBridge, which will route the information in two directions:

  • It will start an AWS Lambda function that will update the status (positive or negative) of the processing performed by Amazon Transcribe into the DynamoDB table.
  • It will send a message to Amazon SNS which will notify (by e-mail) our user.

Upon receiving the notification, our user can then decide what to do with his new file uploaded to Amazon S3 by integrating it into the video player of a web application, for example, using the path of the generated text file to link it to the content displayed in the app, or by simply downloading the file and using it locally with third-party software (such as VLC, which allows the user to choose the subtitle track to be displayed).

Key benefits

As we have just seen, the proposed solution automates a slow and costly process using an cost-effective infrastructure that can be easily integrated into various systems.

Now a question might arise: why would I need this solution?

Let us look in detail at what the reasons might be:

Accessibility:

  • Inclusivity: Subtitles make your content accessible to a broader audience, including people with hearing impairments or those who speak different languages.
  • Compliance: Some countries have legal requirements for subtitles on online content to ensure accessibility for all citizens.

Content Reach and Engagement:

  • Wider Audience: Subtitles allow you to reach a global audience, breaking language barriers and enabling people from different regions to understand and engage with your content.
  • Increased Engagement: Subtitled videos tend to have higher engagement rates, as viewers can comprehend the content better and are more likely to watch it entirely.

Improved User Experience:

  • Flexibility: Viewers can choose to enable or disable subtitles based on their preferences, enhancing their overall viewing experience.
  • Clarity: Subtitles help in clarifying dialogue, especially in videos with poor audio quality or when characters speak in accents or dialects that might be challenging to understand.

Education and Learning:

  • Educational Content: For educational videos, tutorials, or online courses, subtitles can enhance the learning experience by ensuring that the content is clear and understandable to all learners.
  • Language Learning: Subtitled content aids language learners by providing written context alongside spoken words.

Efficient Content Creation:

  • Time Efficiency: Automated subtitle generators can save a considerable amount of time and effort compared to manual subtitle creation, especially for large volumes of content.
  • Consistency: Automated systems ensure consistent subtitle formatting and timing, enhancing the overall professionalism of your content.

Who Can Benefit?

We have therefore understood how the solution works and what benefits we would derive from its use.

The last question to ask is therefore: who can benefit from this solution?

Here are some examples of customers that could benefit from a system integrating an automatic on-demand subtitle generation functionality:

  • Corporate Entities: Improve internal and external communications, making training videos, presentations, and corporate messages accessible to a multinational workforce.
  • Media Houses: Reach wider audiences with news broadcasts, interviews, and documentaries, breaking language barriers and ensuring your stories are heard globally.
  • Educators: Create inclusive learning environments with subtitles that aid understanding, benefiting students of all abilities and backgrounds.
  • Content Creators: Enhance your videos, vlogs, and podcasts, ensuring they resonate with diverse audiences globally. Enabling social media platforms, streaming services, and online content providers to enhance the accessibility of their content, expanding their user base.

Conclusion

In conclusion, we saw together how, thanks to AWS services, we could be able to build a solution consisting of just a few components, but which manages to solve a real problem for various companies by giving new life to disused video material or improve and complement the production of new content while winking at ever topical issues such as accessibility and inclusiveness of its services.

Thank you very much for getting this far, I hope you found this article useful and inspiring!

If you would like to learn more about the topics discussed, I leave you with some useful links:

Happy building!

--

--