Generating Subtitles with OpenAI Whisper

Dmitrii Lukianov

Published in

Akvelon

11 min readMar 15, 2023

Key Takeaways

Overview of OpenAI Whisper and how well it performs in comparison to other Automatic Speech Recognition systems
Whisper Commercial API vs a self-hosted Whisper solution: advantages and disadvantages
How to leverage OpenAI Whisper with Python
Where Whisper stands in terms of translation quality in comparison to different machine translation models
How to enhance the audio transcription quality even more

Introduction

When it comes to the auto-generated subtitles on YouTube, many people have found some aspects that could be improved upon, including:

The transcription accuracy is decent when it comes to English however its accuracy declines when it comes to other languages
The text in presented in a non-normalized format, without any punctuation
Subtitles appear on the screen as words are spoken, while many believe it may be more convenient to see the subtitle line entirely, to “look a bit into the future”
YouTube only supports the most common languages
Sometimes, YouTube just “gives up” on a certain video. For example, Bill Wurtz’s “History of the Entire World I Guess” doesn’t have subtitles even though it’s completely audible. When it comes to challenging videos (such as music videos), YouTube rarely has subtitles available and when it does, they are very often far off
YouTube sometimes has problems with language detection, even in obvious cases. For example, this cartoon was identified as speaking Spanish instead of English

Is it possible to generate better subtitles? It actually is!

I have developed an algorithm that generates subtitles that are vastly superior to YouTube’s auto-generated subtitles, and also supports a large number of languages. This algorithm is based on Whisper, an automatic speech recognition neural network from OpenAI.

Here is a demo video the algorithm results.

In this article, I’m going to

introduce OpenAI Whisper to you
guide you through the process of using Whisper with Python
share some technical details about how I have improved the quality of generated subtitles even more, beyond the Whisper baseline

Whisper

OpenAI Whisper is a transformer-based automatic speech recognition system (see this paper for technical details) with open source code. Whisper is free to use, and the model is downloaded to your machine on the first run. However, you also have an option of using the commercial API from OpenAI.

The largest version of Whisper has 1,550 million parameters. However, there are smaller versions available that you can use to save resources.

Whisper can perform multiple tasks, including:

English audio transcription: given an audio file (multiple formats accepted), it generates a series of captions for speech fragments. Speech fragments also have timestamps attached to them.

OpenAI developers claim in their paper that Whisper outperforms other ASR systems (both free and commercial) in terms of transcription accuracy (in most scenarios) and that in some situations, Whisper results are better than human ones.

Non-English audio transcription: Whisper supports 96 languages besides English. The chart below shows the transcription accuracy for different languages on the Fleurs dataset.

X→English audio translation: the chart below shows the Whisper translation performance on the Fleurs dataset for various languages.

Language detection: the largest Whisper model achieves 80.3% accuracy at the subset of the Fleurs dataset that involves only the languages supported by Whisper.

Whisper Commercial API

There are two ways to use Whisper:

Through the commercial API: you register on the OpenAI website, obtain the API token, and then pay-as-you-go for the requests you send to OpenAI servers
Installing Whisper from the repository: Whisper models are hosted on your machine. Later in this article I’m going to guide you through this approach because it’s more complicated that the first one

There are several advantages and disadvantages to each approach:

The self-hosted solution is free
Unless you have a very powerful GPU server, the commercial API is going to deliver the results much faster
The commercial API is easier to use
It looks like the commercial API performs some post processing of the results, which slightly increases their quality.
The commercial API doesn’t support processing files that are larger than 25MB

Setup

Before setting up Whisper, we need to install the ffmpeg command tool that processes audio files.

# on Ubuntu or Debian
sudo apt update && sudo apt install ffmpeg
# on Arch Linux
sudo pacman -S ffmpeg
# on MacOS using Homebrew (https://brew.sh/)
brew install ffmpeg
# on Windows using Chocolatey (https://chocolatey.org/)
choco install ffmpeg
# on Windows using Scoop (https://scoop.sh/)
scoop install ffmpeg

To install Whisper, we simply run this command:

pip install git+https://github.com/openai/whisper.git

It will install Whisper along with its Python dependencies (pytorch, Hugging Face transformers, and ffmpeg-python).

Audio Transcription

Whisper is quite easy to use. Here is the code that transcribes an audio:

import whisper
model_name = "large"
model = whisper.load_model(model_name, download_root="models")
res = model.transcribe("path//to//your//file.mp4", language='en')

model_name is the Whisper model that you use.
download_root is the location where models get downloaded to at the first run
language is the language of the video. If you don’t specify the language, it will be detected automatically based on the first portion of the video (up to 30 seconds)

res would be a dictionary containing the following fields:

res.text: the entire text recognized
res.segments: the list of dictionaries for speech segments: res.segments[i].start and res.segments[i].end represent the time boundaries (in seconds) of the segment and the res.segments[i].text contains its text

Audio Transcription Results

Let’s compare Whisper transcription results to auto generated subtitles on YouTube.

We will use this fragment from Toy Story as a text sample because:

It contains speech by two different characters
The voice lines differ significantly by the intonation, tempo, and pitch
It also has some background noises that can possibly confuse the models

Audio Translation

Whisper is also capable of performing voice translation: if we set the language to en but the speech in the audio file is not in English, Whisper is going to return us the English translation of what’s being said.

This Whisper “voice translation by language mismatch” approach technically works for other languages as well. However, the translation quality is going to be quite poor when the target language is not English.

Which approach is going to lead to better translation quality: translating the audio directly with Whisper, or first transcribing the audio with Whisper and then translating the results with a separate translation neural network? I’ve conducted an experiment to figure this out.

I have collected 100 sentences from Tatoeba, each of them available in 5 languages: English, Spanish, French, German, and Russian. The mean length of English sentences is 93 characters, with a standard deviation of 25 characters, and a median length of 82 characters.

Using Google’s text-to-speech technology, WaveNet, I generated an audio file for each non-English sentence. These audio files are suitable for use with Whisper for both transcription and translation.

I have selected three machine translation solutions:

MarianMT is a set of translation models that are available to use for free. Here you can find the full list of models. I used the ‘big’ models (they are ~550MB large): Helsinki-NLP/opus-mt-tc-big-cat_oci_spa-en for Spanish (this model also works with Catalan and Occitan); Helsinki-NLP/opus-mt-tc-big-fr-en for French; Helsinki-NLP/opus-mt-tc-big-gmw-gmw for German (it translates between all West Germanic languages: English, German, Dutch, Afrikaans, Low German, Scots, Frisian, Old English and others); Helsinki-NLP/opus-mt-tc-big-zle-en for Russian (it can also handle Ukrainian and Belarusian)
GoogleTranslate, which has a commercial API (costing $20 per one million characters)
DeepL, which also has a commercial API (the price is similar to GoogleTranslate)

Then, I measured the translation quality for 7 scenarios:

Whisper: I called Whisper for the translation task (X→English) directly
Whisper + MarianMT: I first transcribed the speech with Whisper and then translated the results with MarianMT
MarianMT: I translated the sentence directly with MarianMT. This scenario is obviously going to have better results than the previous one because we’re not dealing with audio here at all; the difference will be caused by speech recognition mistakes
Whisper + GoogleTranslate: I first transcribed the speech with Whisper and then translated the results with GoogleTranslate
GoogleTranslate: I translated the sentence directly with GoogleTranslate
Whisper + DeepL: I first transcribed the speech with Whisper and then translated the results with DeepL
DeepL: I translated the sentence directly with DeepL

I measured the similarity between the translation of each non-English sentence in each scenario with its English version using 2 metrics:

BLEU score, which is calculated based on proportions of correctly translated n-grams (n ranging from 1 to 4). This heuristic measure correlated very well with how humans perceive the translation quality
The cosine distance between sentence embeddings using the BERT-like all-mpnet-base-v2 model from sentence-transformers library

For each (scenario, source language) pair I calculated the weighted average (weight equals to the length of the sentence) of these two metrics across all sentences.

I got the following results:

However, sentences used in this test were pretty short and therefore easy to translate. What will happen when we use longer sentences?

In the second test, I have selected 400 longer sentences from Tatoeba: their median length is 226 characters, the mean length is 261 characters with standard deviation of 106 characters. This time around, sentence lists for different languages were different from each other (because it’s hard to find long sentences, each of which is available in all 5 languages) so we can’t compare the translation quality for different languages directly.

The results turned out to be a bit unexpected: Whisper looks better according to the BERT metric, whereas Whisper+Marian is better according to the BLUE metric. In addition, this relation flips if we consider the German language specifically.

What are the main takeaways from these charts?

If you can afford a separate translation model available through the commercial API, it probably will yield better results than using Whisper translation directly
If you can’t, then it’s actually hard to say what’s better, as the translation quality depends on which kinds of sentences do you use and also on the original language of the audio
Whisper transcription errors bring the quality down, but not by much: the quality differences between scenarios 2–3, 4–5, and 6–7 are around the same magnitude as between MarianMT and GoogleTranslate
There are noticeable differences in translation quality between different languages: Spanish is always the easiest to translate, followed by French, German, and then Russian

Language Detection

If you don’t specify the language for the transcribe method, Whisper will detect the language of the audio based on the first 30 seconds of it.

However, what if the first 30 seconds don’t include the actual speech? In this case, language detection results are going to be random. This feels unreliable.

I have implemented a more accurate solution. It takes the specified number of samples from the audio at random places and detects the probability for each language at each sample. Then, it returns the language with the highest average probability across all the samples.

import whisper
from collections import defaultdict
import random

def detect_language(audio_file_path: str, whisper_model_name: str, samples_number=5):
    # load audio
    audio = whisper.load_audio(audio_file_path)

    # load the Whisper model
    model = whisper.load_model(whisper_model_name, download_root="models")

    # optimization: if the audio length is <= Whisper chunk length, then we will only take 1 sample
    if len(audio) <= whisper.audio.CHUNK_LENGTH * whisper.audio.SAMPLE_RATE:
        samples_number = 1

    probabilities_map = defaultdict(list)
    
    for i in range(samples_number):
        # select a random audio fragment
        random_center = random.randint(0, len(audio))
        start = random_center - (whisper.audio.CHUNK_LENGTH // 2) * whisper.audio.SAMPLE_RATE
        end = random_center + (whisper.audio.CHUNK_LENGTH // 2) * whisper.audio.SAMPLE_RATE
        start = max(0, start)
        start = min(start, len(audio) - 1)
        end = max(0, end)
        end = min(end, len(audio) - 1)
        audio_fragment = audio[start:end]
        
        # pad or trim the audio fragment to match Whisper chunk length
        audio_fragment = whisper.pad_or_trim(audio_fragment)

        # extract the Mel spectrogram and detect the language of the fragment
        mel = whisper.log_mel_spectrogram(audio_fragment)
        _, _probs = model.detect_language(mel)
        for lang_key in _probs:
            probabilities_map[lang_key].append(_probs[lang_key])

    # calculate the average probability for each language
    for lang_key in probabilities_map:
        probabilities_map[lang_key] = sum(probabilities_map[lang_key]) / len(probabilities_map[lang_key])

    # return the language with the highest probability
    detected_lang = max(probabilities_map, key=probabilities_map.get)
    return detected_lang

Improvements

I have implemented software that generates subtitles for an audio file using Whisper. My solution also includes some improvements to boost the overall quality of generated subtitles.

Voice Activity Detection

Whisper is a combination of acoustic and language neural networks. When speech is mixed up with non-speech, Whisper could get really confused sometimes — it causes neural text degeneration and desynchronizations. In order to prevent this, it’s better to feed Whisper pieces of audio that contain speech only.

In order to ensure this, a voice activity detection algorithm is being applied. The result of a VAD algorithm is a sequence of intervals supposedly containing speech only. I’m currently using the Brouhana neural network for this.

Fixing Textual Anomalies

Sometimes (usually with non-natural speech, such as loud screams) a line generated by Whisper can be filled with a repetitive pattern (for example, AAAAAAAAAAAAAAAAAAAA…). Quite often such a line will span a segment of audio that also contains some normal speech effectively removing the subtitles for the latter. I call these cases “textual anomalies”.

I have developed a solution to fix this problem:

First, I detect textual anomalies through a heuristic algorithm and remove them from subtitles
Then, I generate new subtitles for the fragment that was previously covered by the anomaly. In order to prevent the anomaly from appearing again, I iteratively shift the beginning of the fragment forward until the result doesn’t contain an anomaly

Translating Subtitles

Translating subtitles represents a challenge: if we translate each subtitle line individually, then the overall translation quality is going to be poor because the translation model doesn’t have enough context.

Let’s consider an example: the German sentence “Eine Katze hatte Bekanntschaft mit einer Maus gemacht.” means “A cat had made the acquaintance of a mouse.”, and that’s exactly what Google Translate outputs. However, imagine if this sentence was split into three parts: “Eine Katze hatte”, “Bekanntschaft mit einer” and “Maus gemacht.”. “Eine Katze hatte” turns into “had a cat”, “Bekanntschaft mit einer” translates to “acquaintance with one” and “Maus gemacht.” becomes “mouse made.”. If we put them together we get “had a cat acquaintance with one mouse made.” which sounds bad.

I have developed a solution to this problem. My algorithm first groups up subtitle lines into large blocks of text. These blocks are then fed into the translation model, so the model has a lot of context and can perform well. Then, we separate the resulting texts into small fragments that correspond to initial subtitle lines. This separation algorithm is based on processing the LaBSE token embeddings.

Contacts

If you have any questions or suggestions regarding this project, feel free to contact us via the email Info@akvelon.com.