How to build a video transcription web app using generative AI

Edgar
about ai
Published in
5 min readJun 17, 2024

A weekend project that shows how to combine web development and generative AI tools

Generative AI is one of the hottest areas in Data Science and Computer Science in general. However, it is also one of the most difficult to be up to date with. The number of research papers that come out every week with a new concept, a new technique, a new model, the number of tools that become available, the number of players in the field even, all of these make any personal attempt to be on top of it, overwhelming and intimidating. All this make difficult to know where and how to start. I believe that this, like any complex task, is better tackled little by little. In this post, I present a small and simple project that can show you how to apply one of the most successful generative ai areas, speech-to-text transcription into a real world problem using a web app.

1. Video transcription

To transcribe speech in videos into text we will use the OpenAI whisper model. This generative model was developed by OpenAI in 2022. It can transcribe speech in multiple languages (see diagram).

OpenAI whisper diagram from https://github.com/openai/whisper

Whisper expects a sound file in wav or mp3 format as input and returns a string with the transcription text from the input file. In this web app, whisper will be a service that will

2. System architecture

Since we start with a video file, first we need to extract the audio from the video. We will also need an interface to interact with the user. For this we will build a web app that will support a page to receive the video file and display the text.

Figure 1. System architecture. A webpage (front-end) is used as interface for the user to provide a video file. The system (back end) should be capable to handle the video provided by the user. A sound extraction module is in charge of extracting the audio from the provided video. Once that is done, the speech-to-text transcription model will produce the text that the user interface will display to the end user.

The modules in figure 1 show how the system is organized, the user uploads a video file, the system saves it locally, the sound is extracted from the video and then it is transcribed into text by the ai model. Finally, the transcribed text is displayed for the user to see via the web page.

Tech stack

These are tools/stack used to implement the modules described above:

  • Front-end (HTML/CSS/JavaScript): To create the user interface.
  • Back-end (Node.js/Express): To handle file uploads (video) and orchestration to coordinate the sound extraction from the video and handling the output (text).
  • Speech-to-Text Service: OpenAI whisper model.

Web server

The back end is node.js. This is handled by a simple server.js

const express = require('express');
const multer = require('multer');
const { exec } = require('child_process');
const fs = require('fs');
const path = require('path');

const app = express();
const upload = multer({ dest: 'uploads/' });

app.use(express.static('public'));

app.post('/upload', upload.single('video'), (req, res) => {
const videoPath = req.file.path;
const command = `npm run transcribe -- ${videoPath}`;

exec(command, (error, stdout, stderr) => {
if (error) {
console.error(`Error: ${error.message}`);
return res.status(500).send('An error occurred while processing the video.');
}
if (stderr) {
console.error(`Stderr: ${stderr}`);
return res.status(500).send('An error occurred while processing the video.');
}
const transcription = stdout;
res.json({ transcription });
});
});

app.listen(3000, () => {
console.log('Server started on http://localhost:3000');
});

Audio extraction and transcription modules

The speech-to-text transcription is carried out by the OpenAI whisper model. However, the audio needs to be extracted from the video before it can be processed by whisper.

import whisper
import sys
from moviepy.editor import VideoFileClip

def extract_audio_from_video(video_path, audio_output_path):
"""
Extract audio from a video file and save it to a new file.
Args:

video_path: str, path to the video file
audio_output_path: str, path to save the audio file
Returns:
None
"""

print(f"Extracting audio from {video_path}...")

try:
# Load the video file
video_clip = VideoFileClip(video_path)

# Extract the audio
audio_clip = video_clip.audio

# Write the audio to a file
audio_clip.write_audiofile(audio_output_path)

print(f"Audio extracted and saved to {audio_output_path}")
except Exception as e:
print(f"An error occurred: {e}")


def transcribe(file_path):
model = whisper.load_model("base")

print("Transcribing...")
# print(model)

result = model.transcribe(file_path)
return result["text"]

# run the code with arguments
if __name__ == "__main__":
file_path = sys.argv[1]
#file_path = "sample.mp3"
print(transcribe(file_path))

Putting things together

Once the steps to install the requirements to run node and whisper, you can run the app with npm start (or node server.js).

Figure 2. Web app demo. This shows the npm start command and the node running server.js file shown above. The web page shows instructions to uoload the video file and the upload button that calls the function that saves the file to disk and runs the audio extraction.

And there you go, the transcription web app is up and running. After uploading a video file (eg. mp4), and waiting for some time, you should see the transcribed text on the web page. The time you need to wait might vary depending on the size of the video you want to transcribe and the computer you are running this. If you have a GPU it might run significantly faster than if you run it on your CPU.

3. Summary

In summary, this post explains how to implement a web app that transcribes videos into text via a web page. The transcription part is done by a generative ai model for speech-to-text (by OpenAI). This simple but educative project can be your point of entry to the development of web app that help people to solve real problems using generative ai.

4. Future work

There are different ways in which the work presented here could be expanded. They are listed in order of difficulty in case you want to expand your project in one of these areas:

Output formatting

With this implementation the output is written directly on the same pane where the file was uploaded. A good extension would be to get the transcription in a scrollable text box with a copy and save to file buttons. Finally, the functionality to store transcriptions and videos in the system can be a very useful extensions.

Authentication

For example, this web app can include an authentication process so that only registered users can use this service. You will need to use a database for this.

Deployment

A very useful extension of this work can be the deployment details to serve this in a cloud environment. You can get an account in one of the main cloud providers, AWS, Azure, GCP, etc. Or you can learn how to serve your website using more specialized tools like vercel. Either way, that can be an excellent way to learn how to scale a real world web app.

5. Code

You can access this work in this github repo. Feel free to contribute to it by creating an issue or a pull request.

6. References

[1] Whisper github repo

--

--

Edgar
about ai

PhD in Computer Science and AI. I write about AI in healthcare and Computer Science in general. The opinions expressed in my stories are my own :)