Produce Easy-to-Update Video Courses with Speech Synth

Use Amazon Polly, Google Slides and FFMpeg to create videos that can be updated at anytime by anyone

6 min readNov 29, 2017

Introduction

I’m a co-founder of a company called Workshops Europe. We create courses that help developers to get productive in technologies that they don’t know very well. We do this primarily through on-site workshops all over Germany. One of our company goals is to provide the best possible learning experience. For us, one way to do this is through blended learning.

Blended learning is a mix of online and on-site training, which allows each learner to find the right mix of self-learning, teaching and practicing. We decided to start our journey to blended learning by first providing some extra video content for some courses. The idea is that developers can quickly access high-quality content to satisfy their need for input and to start thinking about the topics before the on-site workshop starts.

Video content is not new to us, we already created some learning videos and webinars. But we realized that both the production and especially keeping the videos up-to-date has many challenges. In this article, I would like to share with you how we solved these problems by creating our open-source project AudioSlides.IO

This article is about the general concept of this project and offers no in-depth technical content. I will publish separate articles on Elixir Advanced Testing and Continuous Deployment with Elixir+Docker. So just follow me on Twitter @robinboehm if you’re interested in that. If you want to check it out the project on your own it’s open-source on GitHub under MIT Licence.

Challenges when creating Video Content

Have you ever created a video in which you recorded some spoken text? It takes a lot of courage to record your own voice and publish it online to an unknown audience. Oftentimes you feel like your own voice sounds kind of strange. And so you record your voice a thousand times over in order to try to make it as perfect as possible.

Once I’ve accomplished a recording that meets my expectations I always go through a “family and friends review process”. When it comes to English I like to get my content reviewed by my girlfriend because she’s a native speaker and also really honest. The result of this ist that I need to recreate some parts of these audio recording. And recreating some parts of a course that may have up to 3 hours of content is really annoying. Let’s take a look at some challenges that will occur if I try to fix some content in my audio recording.

I need to record it with the same microphone.
Every microphone sounds different. You can’t combine 2 different recordings that just sounds really weird.
I need to record it at the same room.
Every room has it’s very own acoustics. If I do a recording in my apartment it sounds very different to one in our office. But if you did the first recording in the office, you have to wait until everybody is gone and it’s quiet enough.
I need to be hit the same tone.
On a recording you notice if you did that recording on a different day in a different mood. There is always a gap when you “glue” some parts together with audio production software.
It has to be the same voice.
It’s not possible for anyone else than myself to fix or update the recorded content. That is really annoying and blocks you to really scale a productive team.

I also thought about booking a professional speaker or agency for this. Because then you don’t have to deal with many of these challenges like myself being non-native speaker. But I also need to change the video if some update in content, such as a new breaking change in technology release. How long does it take to do an update? What if the speaker isn’t available anymore? To create “only evergreen” content also isn’t a solution if you want to help people learn new things. So let’s think about an improvement for the process of creating and updating video content.

Solution: Generate voice using Speech Synth

The challenge is that recording audio for a video course is really time consuming and could fail because of many reasons. How can we accomplish an easy and repeatable solution for creating a video with spoken text?

If you’re doing a bit of research and read some technical news you may be impressed by the development of the speech-synth tools that getting much better very fast. The current state is that this speech-synth tools aren’t that good to use for everything, but good enough to try them out and not get annoyed by the generated voice. My bet is that maybe in a year or two there is a service or tool that does the job really really well. So let’s write a prototype today to be ready when this time comes!

The Proof-of-Concept Prototype

For our prototype we decided to give Amazon Polly a try. It has a good and simple HTTP-API that allows you to convert text to speech really easily. I’m going to tell you details about it later.

For the visual layer we just used Google Slides because they also provide a really good REST-API that allows you to easily export PNG of a slide. It’s also possible to get the speaker notes via the same API that could be the input for the Amazon Polly transformation.

The last step is to combine the generated voice output with the exported png image and produce a small video sequence. For this we just used a handy command line interface called FFMPEG. So the basic processing would look something like this:

So as output we get an Image, a Mp3 and a Mp4 for every slide that’s part of the presentation. After the generation of all slides is done we gather all videos and using FFMPEG concat to create the whole video presentation.

To update any part of the video everyone is able to make the changes in the source presentation and re-generate the video. There is no need for a special setup. Last week I just updated the example video on a train-ride to Berlin. The day after a friend of mine extended it with some new content.

When we re-generate the video only the parts that changed are re-generated. This is checked via a hash function that validates the current version of the generated artefact. But it is already quite fast to generate a video. We’re using the Google Cloud to host our project with a small VM. To generate the whole 6 min video it took the process~2min. When you update a small part of a video like one slide with 10sec of spoken-text in it takes the process ~5 seconds and everything is up-to-date. So we’re finally able to create easy-to-update video learning content 🎉

Example Output

As shown before we need a Google Presentation as a starting point. My input will be a short slide deck about the new release of Angular version 5. I wrote a few words as speaker-notes on every slide as input for our pipeline.

[AudioSlides] Angular 5.0.0 Overview

Overview Angular Release v5 This video gives you a small breakdown of some of the biggest changes in Angular version 5

docs.google.com

I’ve uploaded the example video on Youtube. So feel free to view it and give me some honest feedback. I’m testing different voices and not sure which one is the best. But I think this is also a moving target because they all getting better and better every day. For now it’s good enough to start with. Because all I have to do is re-generate the videos when a new more improved version of the voice is available.

What’s next?

As we’ve written the whole project in Elixir and are also big fans of functional languages, we are going to create a course for Elixir and Phoenix that is generated by the project. If you’re interested in helping us to create this content and/or want to get early access to the material and provide feedback, just reach out to us via info@audioslides.io.

For general updates on the project we also created a basic newsletter on http://audioslides.io/ that will inform you about new features as well as about new courses.

Thanks for reading this article ❤