ImageSource : BunnyStudio

Implementing Voice over Slideshow : Fast and Portable

Vikalp Singh
MindTickle
Published in
5 min readMay 10, 2020

--

Context : I will discuss about the server side tech internals of a platform which can be used to generate optimised basic Voice over Slideshows and deliver them as a video.

In the domain of content management and learning systems, Voice over Slideshow is a crucial medium for facilitating streamlined information delivery which is easy to consume. ‘Voice over Slideshow’ as the name says, is a content package where a presenter narrates while moving over a bunch of slides, to convey some information. But what’s going on under the hood? How is this managed, how is this delivered consistently across platforms? If you have thought about these questions at any point in your head after stumbling through the title of this article, you’re in the right place!

Most CMS Platforms support Voice over Slideshow and have their own way to packaging and delivering it. I’ll walk you through my way of standardising and delivering this creating a video from the presentation and voice. This simplifies delivery over a wide variety of platforms like mobile phones, tablets or web, is fast and easy to maintain. .

Step-1 : Getting the Primitives Together

Images and Audio Files will be your bread and butter in this technique.

But your user probably uploaded a presentation (ppt, pptx etc.), hence your first task becomes getting images out of the presentation. This is done best at host machine where the presentation is generated, using the tool itself to export each slide as in image. This is because the formats and fonts any presentation tool uses may not be standard and can cause rendering issues in a different environment. If you can ask your user to do so, you’re in a good place. Images taken as an input on your platform in lieu of a presentation, will generate better results at cost of an inferior UX.

If you can’t ask your user for images, don’t worry. Another way of getting images from a presentation is using services like Box and Filestack. These services offer APIs and SDKs to transform a presentation to images. I personally have achieved good results from Filestack using Box as a backup system

Yet another way is using Open Office headless. But it comes with it’s own maintenance. You can choose any of the ways mentioned above to get your images.

Coming to audio, audio for every slide can be recorded using the browser, and can be shipped to server for storage. Preserving mapping for audio track to slide number is crucial here as you’d need it to know what goes where

Step-2 Stitching the pieces

Here comes the sauce, FFMPEG

FFMPEG is a powerful media processing tool which can help stitch an image and audio together to form a video snippet. You’d leverage this tool to generate a final video. There are wrappers written for this tool for almost all programming languages, and even if there are no good ones for yours, you can always fire a plain old shell script from your code. There are two approaches here that you may take depending on the use case and resources available

  1. If you have a machine with high compute resources available (possibly with GPU): FFMPEG is a resource intensive tool which can also use a GPU for video processing. You can concatenate the audio files using ffmpeg concat filter. You can use the filter again for stitching images and creating a video. { ffmpeg -concat -i conf.txt -c copy output.mp4 }. conf.txt will contain the mapping for images to audio duration. The time this process will take will depend on the resources made available to ffmpeg. Since you’ve a big machine, this shouldn’t be much
  2. Multiple small machines : You can stitch every image to its audio generating small video snippets using ffmpeg concat command in a concurrent fashion. You then finally stitch individual videos together to generate a final video. This requires multiple small hosts to do things in parallel, state management to track the status of each process. So a fair level of complexity will be involved to track these processes and build around it. But I have something to help you around this, AWS Serverless!

Scale with AWS

AWS offers powerful services like AWS Step Functions and AWS Lambda that can be used to generate videos in the aforementioned fashion.

You will need two Lambda functions for this, one to concatenate a single image and a single audio into a video snippet, second one to stitch multiple videos into a single video. I used FFMPEG Lambda layers to run ffmpeg on AWS Lambda. Both the functions are fairly straight forward ffmpeg commands that you need to invoke

To mix it all together, create a step function to orchestrate the two beautiful lambdas you just created above. Use Map Reduce paradigm of Step functions for this. Using map reduce, you map your job into multiple jobs of one slide each to create a video snippet. Once all the mapping is done, reducer (your second lambda function) triggers. This lambda can stitch all your videos together into a single video.

Since AWS lambda can scale to great lengths, using AWS infrastructure will allow you to generate a video in (almost) constant amount of time for a very large range of slides. However, this comes with certain caveats. Since Lambda has limited disk space (512mb as of now), stitching a big video will be an issue. In my test, I found that a 30min audio snippet (256 kbps) over a 720p image generated ~2 mb of video with my desired settings. This will limits you to 125 slides (250mb) videos (equal space for final stitched video is required).

125 slides with 30min audio per slide was a fairly decent number for my use case. Although it can be increased further by streaming output to S3 directly instead of storing on disk (2x size in this case). Even if that does not suffice, AWS lambda offers good amount of RAM for a function. This can be leveraged for mounting a RAM disk and writing on this. I did not try this out, but should be possible.

I hope you will try this out and let me know any caveats you found in the comments. S this was one way of creating a Voice over Slideshow media package in a portable manner that can serve well in at scale in a production environment with performance guarantees when coupled with AWS.

--

--