How to build a video transcription web app using generative AI
A weekend project that shows how to combine web development and generative AI tools
Generative AI is one of the hottest areas in Data Science and Computer Science in general. However, it is also one of the most difficult to be up to date with. The number of research papers that come out every week with a new concept, a new technique, a new model, the number of tools that become available, the number of players in the field even, all of these make any personal attempt to be on top of it, overwhelming and intimidating. All this make difficult to know where and how to start. I believe that this, like any complex task, is better tackled little by little. In this post, I present a small and simple project that can show you how to apply one of the most successful generative ai areas, speech-to-text transcription into a real world problem using a web app.
1. Video transcription
To transcribe speech in videos into text we will use the OpenAI whisper model. This generative model was developed by OpenAI in 2022. It can transcribe speech in multiple languages (see diagram).
Whisper expects a sound file in wav or mp3 format as input and returns a string with the transcription text from the input file. In this web app, whisper will be a service that will
2. System architecture
Since we start with a video file, first we need to extract the audio from the video. We will also need an interface to interact with the user. For this we will build a web app that will support a page to receive the video file and display the text.
The modules in figure 1 show how the system is organized, the user uploads a video file, the system saves it locally, the sound is extracted from the video and then it is transcribed into text by the ai model. Finally, the transcribed text is displayed for the user to see via the web page.
Tech stack
These are tools/stack used to implement the modules described above:
- Front-end (HTML/CSS/JavaScript): To create the user interface.
- Back-end (Node.js/Express): To handle file uploads (video) and orchestration to coordinate the sound extraction from the video and handling the output (text).
- Speech-to-Text Service: OpenAI whisper model.
Web server
The back end is node.js. This is handled by a simple server.js
const express = require('express');
const multer = require('multer');
const { exec } = require('child_process');
const fs = require('fs');
const path = require('path');
const app = express();
const upload = multer({ dest: 'uploads/' });
app.use(express.static('public'));
app.post('/upload', upload.single('video'), (req, res) => {
const videoPath = req.file.path;
const command = `npm run transcribe -- ${videoPath}`;
exec(command, (error, stdout, stderr) => {
if (error) {
console.error(`Error: ${error.message}`);
return res.status(500).send('An error occurred while processing the video.');
}
if (stderr) {
console.error(`Stderr: ${stderr}`);
return res.status(500).send('An error occurred while processing the video.');
}
const transcription = stdout;
res.json({ transcription });
});
});
app.listen(3000, () => {
console.log('Server started on http://localhost:3000');
});
Audio extraction and transcription modules
The speech-to-text transcription is carried out by the OpenAI whisper model. However, the audio needs to be extracted from the video before it can be processed by whisper.
import whisper
import sys
from moviepy.editor import VideoFileClip
def extract_audio_from_video(video_path, audio_output_path):
"""
Extract audio from a video file and save it to a new file.
Args:
video_path: str, path to the video file
audio_output_path: str, path to save the audio file
Returns:
None
"""
print(f"Extracting audio from {video_path}...")
try:
# Load the video file
video_clip = VideoFileClip(video_path)
# Extract the audio
audio_clip = video_clip.audio
# Write the audio to a file
audio_clip.write_audiofile(audio_output_path)
print(f"Audio extracted and saved to {audio_output_path}")
except Exception as e:
print(f"An error occurred: {e}")
def transcribe(file_path):
model = whisper.load_model("base")
print("Transcribing...")
# print(model)
result = model.transcribe(file_path)
return result["text"]
# run the code with arguments
if __name__ == "__main__":
file_path = sys.argv[1]
#file_path = "sample.mp3"
print(transcribe(file_path))
Putting things together
Once the steps to install the requirements to run node and whisper, you can run the app with npm start (or node server.js).
And there you go, the transcription web app is up and running. After uploading a video file (eg. mp4), and waiting for some time, you should see the transcribed text on the web page. The time you need to wait might vary depending on the size of the video you want to transcribe and the computer you are running this. If you have a GPU it might run significantly faster than if you run it on your CPU.
3. Summary
In summary, this post explains how to implement a web app that transcribes videos into text via a web page. The transcription part is done by a generative ai model for speech-to-text (by OpenAI). This simple but educative project can be your point of entry to the development of web app that help people to solve real problems using generative ai.
4. Future work
There are different ways in which the work presented here could be expanded. They are listed in order of difficulty in case you want to expand your project in one of these areas:
Output formatting
With this implementation the output is written directly on the same pane where the file was uploaded. A good extension would be to get the transcription in a scrollable text box with a copy and save to file buttons. Finally, the functionality to store transcriptions and videos in the system can be a very useful extensions.
Authentication
For example, this web app can include an authentication process so that only registered users can use this service. You will need to use a database for this.
Deployment
A very useful extension of this work can be the deployment details to serve this in a cloud environment. You can get an account in one of the main cloud providers, AWS, Azure, GCP, etc. Or you can learn how to serve your website using more specialized tools like vercel. Either way, that can be an excellent way to learn how to scale a real world web app.
5. Code
You can access this work in this github repo. Feel free to contribute to it by creating an issue or a pull request.