How to use Bacalhau and OpenAI Whisper to transcribe your video and audio files
Captions, subtitles, and transcripts are all ways to help your audio and video content reach a wider audience and encourage more interaction with its readers and listeners.
Over the past week, I experimented with a new approach to video and audio transcription. I used OpenAI Whisper and Bacalhau to transcribe my downloaded YouTube videos and extract the text from them in different formats (
I was eager to give this a try, and I was blown away by how well it worked and how simple it was to implement. This process is not exclusive to YouTube videos only, you can try it out on any video or audio of your choice.
If this is something you want to try out, this is a tutorial on how to get started.
Here’s the roadmap for this project:
- Install dependencies
- Create a Whisper Python script
- Create a Dockerfile to containerize your Whisper script
- Run Whisper on Bacalhau
But before you get started, let’s dive into why I decided to use Bacalhau and Open AI Whisper and what it is all about
What is Bacalhau and OpenAI Whisper?
Whisper is an open-source, general-purpose speech recognition model developed by OpenAI. It is a multi-task model trained on a large dataset to perform language recognition, vocal activity detection, transcription, and translation. In addition to English, Whisper was trained in over 96 languages with 680,000 hours of audio.
Bacalhau (Compute Over Data, or CoD) is a network of open compute resources made available to serve any data processing workload. It processes and transforms large-scale datasets by enabling users to run arbitrary Docker containers and (WebAssembly) wasm images against data stored in IPFS (InterPlanetary File System). Bacalhau operates as a peer-to-peer network of nodes where each node participates in executing and computing jobs submitted to the cluster.
The advantage of using Bacalhau over managed Automatic Speech Recognition services
- You can manage your own containers that can scale to batch process petabytes (quadrillion bytes) of audio and video files.
- Using its sharding feature, you carry out distributed inference very easily. Typically, distributed inference is carried out on large-scale datasets with millions of records.
- If you have the data stored on IPFS you don’t need to move the data, you can compute where the data is located.
- The cost of computing is much cheaper than managed services.
- Install FFmpeg an audio-processing library.
sudo apt update && sudo apt install ffmpeg
brew install ffmpeg
chco install ffmpeg
2. Install Pytorch, an open-source machine learning (ML) framework
pip install torch
3. Install Whisper, an open-source speech recognition model
pip install git+https://github.com/openai/whisper.git -q
4. Install Bacalhau, to compute the data processing workload
curl -sL https://get.bacalhau.org/install.sh | bash
Create Whisper Python script
For the Whisper script, you will need to create a file called
openai-whisper.py. Below is the Whisper sample script code written by the Bacalhau team. Copy and paste the code below into your
The above script accepts and sets the required parameters, like input file path, output file path, temperature, etc. Next, the script is configured to execute on the GPU and also convert
.mp4 files to
.wav files. The Whisper model “large” is used. You can find more information about the different Whisper models. Next, the script is set to save the output transcript in various formats after we have loaded the model.
Test the Whisper script
In order to test the script to ensure everything works as expected. You’ll need to run the following commands below in your terminal
To download the test audio clip
To run your whisper script
To view the output for the test sample audio
#View the text document file format
#view the subtitle file format
#view the WebVTT format
Create a Dockerfile to containerize your Whisper script
At this stage, you will need to create a
Dockerfile to containerize your Python Whisper script. A Dockerfile is a text file that contains instructions that Docker uses to create a container image. You can check the docs to learn more about Docker.
To containerize the script
- Create an empty file called
2. In the
Dockerfile, add the following lines of code. These commands specify how the image will be built, and what extra requirements will be included.
3. Right-click on the
Dockerfile and click on build image
So what exactly is happening in the Dockerfile?
pytorch/pytorch:1.12.1-cuda11.3-cudnn8-runtimeimage is used as the base image.
- The dependencies to be installed are added to the container
- The test audio file and our
openai-whisperthe script is also added to the container
- Finally, docker is run to check if the container builds successfully.
Running Whisper on Bacalhau
This is the point where you get to transcribe your video. As stated earlier, I’ll be using this Youtube video (this is an 8-minute long video) as an example to show how this works. I downloaded the video in
You can use any video of your choice, it doesn’t have to be a YouTube video
Get CID number
After downloading your video, the next step is to upload it to IPFS to get the CID (content identifier) number. You can use NFTUp to upload the video by following the steps below:
- Create an account on NFTUp
- Get your key on your account page.
- Drag and drop your downloaded video for it to be uploaded
- Copy your CID number
For this example, the CID number is:
Run the container on Bacalhau
To run the container on Bacalhau, copy and paste the following command into your terminal
bacalhau docker run \
> jsacex/whisper \
> --gpu 1 \
> -v bafybeidwbzzi3hjg54tvdabiesc54lrb3qerzunu4uuahh3o6g3tfmitee:/ytvideo.mp4 \
> -- python openai-whisper.py -p ytvideo.mp4 -o outputs
From the above command:
- The — gpu flag denotes the no of GPUs we are going to use
- The -v flag mounts our file to a specific location
- -p provides the input path of our file
- -o provides the output path of the file
When you run the command, Bacalhau prints out the related job id:
At this point, you can free-style and run a series of Bacalhau commands to find out more about the job submitted.
To find out the state of your job, run the following command
bacalhau list --id-filter f07d5a18-3c5c-4df7-8269-1695ca61ae86
When it says
Completed, this means the job is done, and you can get the results.
To find out more information about your job, run the following command:
bacalhau describe f07d5a18-3c5c-4df7-8269-1695ca61ae86
Once your job is complete, you will be getting something like this.
Job successfully submitted. Job ID: f07d5a18-3c5c-4df7-8269-1695ca61ae86
Checking job status... (Enter Ctrl+C to exit at any time, your job will continue running):
Creating job for submission ... done ✅
Finding node(s) for the job ... done ✅
Node accepted the job ... done ✅
Job finished, verifying results ... done ✅
Results accepted, publishing ... Results CID: QmWPpwPiBtkJtk5tg7FZnEHzWMEZhFUdbz5vWd1dHsTJ6Q
Job Results By Node:
Container Exit Code: 0
Stdout (truncated: last 2000 characters):
] one day this will go back it's not in that comfort zone it's in the discomfort zone
[06:37.840 --> 06:41.520] is where my confidence is getting good that's what's getting good the people
[06:41.520 --> 06:46.160] they want an easier answer there has to be an easier way it's not I'm sorry I
[06:46.160 --> 06:54.000] searched for my entire life we're built for struggle us human beings you know
Note: For the sake of brevity, I removed some parts of the result
The job’s outputs are saved on IPFS after processing is complete. To locally download your result, create an output directory to save your results
Use the Bacalhau command below to download the results in your output directory
bacalhau get f07d5a18-3c5c-4df7-8269-1695ca61ae86 --output-dir results
After the download has finished your contents in the results directory
View the Output
In your result folder, you have three sub-folders that contain the output formats
You can view your result in either.
Below, is a screenshot from the
And that is it! very accurate and very clean results. You can do more with this. You can transcribe your movie footage, podcasts, lecture recordings, etc.
Interesting Applications for Bacalhau
Most likely, this post has piqued your interest in Bacalhau, and if so, I should tell you that I’ve only touched the surface of the many applications to which it may be put. It has a number of interesting applications beyond speech recognition using whisper. It can be used for image processing, data conversion, generating realistic images using styleGAN3, and much more.