SeamlessM4T vs. Whisper — A Speech-to-Text Benchmark

Published in

𝐀𝐈 𝐦𝐨𝐧𝐤𝐬.𝐢𝐨

9 min readSep 22, 2023

Recently, MetaAI unveiled their latest creation: “SeamlessM4T,” a groundbreaking, all-in-one multimodal translation model. This remarkable innovation encompasses Speech-to-Text, Speech-to-Speech, Text-to-Speech, and Text-to-Text functionalities. Notably, it boasts an impressive capacity to process content in over 100 languages as input, with a proficiency in producing speech output in 35 of them.

In September 2022, OpenAI introduced us to Whisper, now reigning as the premier open-source Speech-to-Text model of choice. With these monumental strides in technology, it’s imperative to conduct a thorough benchmark and performance comparison between these two titans.

The significance of Speech-to-Text technology

Speech-to-text technology finds crucial application in diverse fields. It powers transcription services for content creators, aids individuals with hearing impairments, facilitates real-time communication for customer service centers, and assists open-source researchers in their investigations. In academic settings, it assists students in capturing lectures, and in multilingual environments, it acts as an essential tool for language translation.

At its core, Speech-to-text technology employs advanced algorithms to analyze audio input. It deciphers phonetic patterns, linguistic context, and acoustic cues to accurately transcribe spoken language into written text. This process relies on machine learning models that have been trained on vast datasets of diverse speech samples.

For all these reasons, Speech-to-text is not just valuable, but truly vital.

Embarking on the Whisper journey

As outlined on their official website:

Whisper is an automatic speech recognition (ASR) system trained on 680,000 hours of multilingual and multitask supervised data collected from the web. We show that the use of such a large and diverse dataset leads to improved robustness to accents, background noise and technical language. Moreover, it enables transcription in multiple languages, as well as translation from those languages into English. We are open-sourcing models and inference code to serve as a foundation for building useful applications and for further research on robust speech processing.

Whisper models at a glance

Whisper offers a range of models, each designed to cater to specific requirements:


+--------+------------+--------------------+--------------------+-----------+
|  Size  | Parameters | English-only  | Multilingual | Required | Relative  |
|        |            |  model        | model        | VRAM     | Speed     |
+--------+------------+--------------------+--------------------+-----------+
| tiny   | 39 M       | ✓             | ✓            | ~1 GB    | ~32x      |
| base   | 74 M       | ✓             | ✓            | ~1 GB    | ~16x      |
| small  | 244 M      | ✓             | ✓            | ~2 GB    | ~6x       |
| medium | 769 M      | ✓             | ✓            | ~5 GB    | ~2x       |
| large  | 1550 M     | N/A           | ✓            | ~10 GB   | 1x        |
+--------+------------+--------------------+--------------------+-----------+

In addition, a version 2 of the large model is available, having undergone extensive training with 2.5 times more epochs, and fortified with additional regularization for heightened performance.

A benchmark conducted on the Minds14 dataset in French showcases that the utilization of large-v2 leads to an impressive 25% enhancement in transcription accuracy (WER and CER) within a similar computational time frame.

Transcription with Whisper

Transcribing audio with Whisper is a straightforward process, especially with the Transformers implementation.

To begin, it’s essential to install a few necessary packages:

pip install transformers datasets

from transformers import WhisperProcessor, WhisperForConditionalGeneration
from datasets import Audio, load_dataset
import torch

# load model and processor
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
processor = WhisperProcessor.from_pretrained("openai/whisper-large")
model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-large").to(device)
forced_decoder_ids = processor.get_decoder_prompt_ids(language="french", task="transcribe")

sampling_rate = 16000.0

# load PolyAI dataset, train split, French audio
dataset = load_dataset("PolyAI/minds14", "fr-FR", split="train")
#resample the dataset from 8KHz to 16Khz
dataset = dataset.cast_column("audio", Audio(sampling_rate=sampling_rate))

#Take the first audio file and transcribe it
input_speech = dataset[0]['audio']
input_features = processor(input_speech["array"], sampling_rate=input_speech["sampling_rate"], return_tensors="pt").input_features.to(device)
predicted_ids = model.generate(input_features, forced_decoder_ids=forced_decoder_ids)
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)
print(transcription[0])
#Result: Je souhaite changer mon adresse.

Getting started with Seamless

This is an extract from Seamless’s paper:

we introduce SeamlessM4T — Massively Multilingual & Multimodal Machine Translation — a single model that supports speechto-speech translation, speech-to-text translation, text-to-speech translation, text-to-text translation, and automatic speech recognition for up to 100 languages. To build this, we used 1 million hours of open speech audio data to learn self-supervised speech representations with w2v-BERT 2.0. Subsequently, we created a multimodal corpus of automatically aligned speech translations, dubbed SeamlessAlign. Filtered and combined with humanlabeled and pseudo-labeled data (totaling 406,000 hours), we developed the first multilingual system capable of translating from and into English for both speech and text. On Fleurs, SeamlessM4T sets a new standard for translations into multiple target languages, achieving an improvement of 20% BLEU over the previous state-of-the-art in direct speech-to-text translation

Seamless models at a glance

Seamless is released under 2 versions:

Medium: 1.2 billion parameters. 6.4 GB.
Large: 2.3 billion parameters. 10.7GB.

Transcription with Seamless

Leveraging the script provided on HuggingFace’s website, transcription is again a straightforward process

To begin, it’s essential to install a some packages:

git clone https://github.com/facebookresearch/seamless_communication.git
cd seamless_communication
pip install .
pip install soundfile torchmetrics

import torch
import torchaudio
from seamless_communication.models.inference.translator import Translator
import datasets
from datasets import Audio
import soundfile as sf

#Load device and model
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
translator = Translator(
    model_name_or_card="seamlessM4T_large",
    vocoder_name_or_card="vocoder_36langs",
    device=device,
    dtype=torch.float32
)

sampling_rate = 16000.0

dataset = datasets.load_dataset("PolyAI/minds14", "fr-FR", split="train")
dataset = dataset.cast_column("audio", Audio(sampling_rate=16000.0))

audio = dataset[0]['audio']['array']
#save file locally before transcription
sf.write('test.wav', audio, 16000)
transcribed_text, _, _ = translator.predict( 'test.wav', "asr", 'fra')
print(str(transcribed_text))
#Result: Je souhaite changer mon adresse.

Benchmark methodology

In this benchmark, we will focus on comparing the large versions of the models (large-v2 for Whisper) and will employ the following methodology:

Metrics employed

For each model, we will compute their Word Error Rate (WER) and Character Error Rate (CER) against several datasets. Additionally, we’ll measure the time taken to process the audio files.

To calculate the WER, we will utilize the package ‘evaluate’ and the Text Normalizer provided by Whisper. This normalization step is crucial as it ensures that factors like punctuation and casing do not skew the results, as exemplified in the code snippet below:

import evaluate
from transformers.models.whisper.english_normalizer import BasicTextNormalizer

wer_metric = evaluate.load("wer")
normalizer = BasicTextNormalizer()

prediction = "Je souhaite changer mon adresse."
reference = "je souhaite changer mon adresse"

wer = 100 * wer_metric.compute( references=[reference], predictions=[prediction] )

print("Non-normalized WER:", wer)
#Non-normalized WER: 40.0

prediction = normalizer(prediction)
reference = normalizer(reference)

wer = 100 * wer_metric.compute( references=[reference], predictions=[prediction])

print("Normalized WER:", wer)
#Normalized WER: 0.0

For the CER, we will utilize the ‘torchmetrics.text’ package.

Utilized datasets

For this benchmark, we will leverage several datasets with varying characteristics:

AMI: Spontaneous, meetings, noisy speech conditions (English only.)
Dataset card: https://huggingface.co/datasets/edinburghcstr/ami
Minds14: Intents extracted from a commercial system in the e-banking domain (14 languages.)
Dataset card: https://huggingface.co/datasets/PolyAI/minds14
Common Voice 13: Narrated Wikipedia text & crowd-sourced speech (108 languages.)
Dataset card: https://huggingface.co/datasets/mozilla-foundation/common_voice_13_0

The AMI dataset was deliberately selected due to sub-optimal speech conditions.

Evaluated languages

Our assessment will cover the following languages:

English. Datasets: AMI + Common Voice 13
French. Datasets: Minds14 + Common Voice 13
Russian. Dataset: Common Voice 13

For each language, we will transcribe up to 3000 recordings.

Benchmark script

The evaluation will include the following steps:

1 — Load the dataset and prepare it: resample to 16KHz, filter files longer than 30 seconds or shorter than 1.5 seconds

2 — Looping over the dataset

3 — For each sample, transcribe the audio and store the result

4 — Normalize predictions and references

5 — Compute the WER

Here is an example for Whisper on the Minds14 dataset.

Please bear in mind that this script must be tailored to the dataset you are using. Additional scripts can be provided upon request. Do not hesitate to reach out.

from transformers import WhisperProcessor, WhisperForConditionalGeneration
from datasets import Audio, load_dataset
import evaluate
from tqdm import tqdm
from transformers.models.whisper.english_normalizer import BasicTextNormalizer
from torchmetrics.text import CharErrorRate
import torch
import time
import re

# load model and processor
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
processor = WhisperProcessor.from_pretrained("openai/whisper-large-v2")
model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-large-v2").to(device)
forced_decoder_ids = processor.get_decoder_prompt_ids(language="french", task="transcribe")

#Variables for preprocessing the dataset
sampling_rate = 16000.0
max_input_length = 30

# load dataset and resample to 16Khz
dataset = load_dataset("PolyAI/minds14", "fr-FR", split="train")
dataset = dataset.cast_column("audio", Audio(sampling_rate=sampling_rate))


#function to filter too long recordings
def is_audio_in_length_range(length):
    return length < max_input_length

#function add to the dataset a length field
def prepare_dataset(example):
    audio = example["audio"]
    # compute input length of audio sample in seconds and add a column to the dataset
    example["input_length"] = len(audio["array"]) / audio["sampling_rate"]
    return example

#apply prepare dataset function
dataset = dataset.map(
    prepare_dataset
)

#remove items longer than 30s
dataset = dataset.filter(
    is_audio_in_length_range,
    input_columns=["input_length"],
)

#Metrics
wer_metric = evaluate.load("wer")
#Normalizer provided by Whisper
normalizer = BasicTextNormalizer()
#Caracter error rate
cer_metric = CharErrorRate()

length = dataset.num_rows

all_predictions = []
all_references = dataset['transcription'][0:length]

t=0
#for each item in the dataset, transcribe and store results in all_predictions
for i in tqdm(range(0,length)):
    input_speech = dataset[i]['audio']
    t0= time.time()
    input_features = processor(input_speech["array"], sampling_rate=input_speech["sampling_rate"], return_tensors="pt").input_features.to(device)
    predicted_ids = model.generate(input_features, forced_decoder_ids=forced_decoder_ids)
    transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)
    t+= (time.time() - t0)
    all_predictions.append(transcription[0])


#normalize both predictions and references. Also remove additional whitespace at the end of the predictions
all_predictions = list(map(lambda x: normalizer(x), all_predictions))
all_predictions = list(map(lambda x: re.sub(' $', '',x), all_predictions))
all_references = list(map(lambda x: normalizer(x), all_references))

wer = 100 * wer_metric.compute(
    references=all_references, predictions=all_predictions
)

cer = 100 * cer_metric( all_references, all_predictions  ).item()


print('WER:', wer)
print('CER:', cer)
print('Time:', t)

Outcome of the benchmark

Word Error Rate

Character Error Rate

Processing Time

Analysis

In this benchmark, Seamless exhibited superior performance compared to Whisper, showcasing faster inferences at a rate of approximately 50% to 60% swifter.

Seamless demonstrated excellent proficiency in the Common Voice dataset, achieving commendable Word Error Rate (WER) and Character Error Rate (CER) scores.

However, in environments with higher levels of noise, such as the AMI dataset, Whisper demonstrated a slight edge over Seamless. It’s worth noting that both models would benefit from fine-tuning to further enhance their accuracy when dealing with noisy audio files.

One noteworthy observation is that Whisper occasionally suffers from what can be termed as “hallucinations”. These instances involve the model generating excessively long text when it’s uncertain about the actual content. An illustrative example in English vividly highlights this:

Hallucination Example with Whisper:

REFERENCE: “Due to the expense of production it is not being used for this purpose.”

PREDICTION: “I said go go go, do what you want, do what you want, do what you want…” (repeated multiple times)

The resulting Word Error Rate (WER) for this sample with Whisper reached a substantial 2542, significantly influencing the overall evaluation. In contrast, Seamless achieved a WER of 7.14 for the same audio sample.

Limitations

While this benchmark provides valuable insights, it’s important to acknowledge its limitations. The assessment encompassed a diverse range of datasets, yet the number of languages evaluated remained limited.

Furthermore, a more comprehensive evaluation could be conducted with an increased volume of audio files.

Lastly, filtering out predictions containing hallucinations would change the outcome of the benchmark.

Conclusion

In this benchmark, Seamless emerged as the front runner, displaying impressive speed and accuracy, particularly in the Common Voice dataset. However, it’s essential to acknowledge Whisper’s strength in handling noisy environments, underscoring its adaptability.

It’s worth noting that Whisper does encounter occasional challenges, as demonstrated by instances of “hallucinations”. These moments, while infrequent, strongly impact its global performance.

To wrap up, it is clear this benchmark gave us some solid intel. Further validation on other languages could help get the full scoop on what these models can really achieve.

#AI #Speech2text #Benchmark