ASR state-of-the-art: Wav2Vec, Whisper, DeepSpeech

5 min readDec 4, 2022

In this tutorial we are gonna cover three state-of-the-art models for ASR and infer them on stuttering speech. It’s important that current ASR systems used in voice assistants understand all types of speech so people with such phenomena can use them freely without embarrassment.

https://stutterfree.org/campaigns-usually-require-a-campaign-manager/

We will collect five samples of stuttering speech from the Dataset of people who stutter collected by Apple data-scientists. This dataset includes samples with five stuttering event types: Blocks, Prolongations, Sound Repetitions, Word/Phrase Repetitions, and Interjections.

Sound representation

I’ve taken five samples one of which is a part of a podcast with a woman who repeated sound ag in word again. You can listen to it with python IPython API.

import IPython

IPython.display.Audio("/content/stuttering_samples/WomenWhoStutter_6_7.wav")

The MelSpectrogram of this sample depicts this repetition. The first syllable repits in the speech and its frequency representation is the same in the whole word (40–60 frames) and in the beginning. It can make the process of ASR confusing.

Sadly, transcripts are not available in this dataset, so I tried to make my own five transcripts for this data. I’m not native English speaker, so if you find any mistake, please write in the comments.

Let’s try to check how difficult it is for current ASR models to transcribe such samples.

Wav2Vec

Wav2Vec 2.0 is one of the current state-of-the-art models for Automatic Speech Recognition based on unsupervised training. It consists of three parts:

Raw waveform representation: conv layers that process the raw wavefrm to get latent representaton.
Context part: transformer layers which create a contextualised representation.
Linear layer: linear projection to output.

The most important part is first layer. In the first phase (pertaining or self-supervised training) we do not train linear projection and focus on a comprehensive audio representation. The loss is constructed in two terms: contrastive loss and diversity loss.

Contrastive loss requires to identify the true quantized latent speech representation for a given context network output over other quantized representations.
Diversity loss is created to increase the use of the quantized codebook which is used in contrastive loss. As we have multiple codebooks with n number of code words in them it is required to make use of all of them. Therefore maximizing entropy we encourage the model to take advantages of all code words at the entropy assumes the maximum value when the data distribution is uniform.

explanation of wav2vec
wav2vec 2.0 article
another explanation of wav2vec 2.0

We first import pretrained wav2vec 2.0 with transformers API.

from transformers import AutoProcessor, AutoModelForCTC

processor = AutoProcessor.from_pretrained("facebook/wav2vec2-base-960h")
model = AutoModelForCTC.from_pretrained("facebook/wav2vec2-base-960h")

Then we use a function from tutorial of wav2vec fine-tuning and create a huggingface based dataset. Each sample is processed by Wav2Vec processor with both audio data and text data.

import librosa
from datasets import Dataset

def prepare_dataset(batch):
    audio, sr = librosa.load(batch["audio_paths"], sr=16000)
        
    info = processor(audio, sampling_rate=sr)
    batch["input_values"] = info.input_values[0]
    batch["input_length"] = len(batch["input_values"])
    
    with processor.as_target_processor():
        batch["labels"] = processor(batch["texts"]).input_ids
    return batch

wav2vec_dataset = Dataset.from_pandas(test_data)
wav2vec_dataset = test_dataset.map(prepare_dataset, num_proc=2)

To get predictions we first pad our data, get logits, choose the best sequence according to CTC decoding and then decode it into words with the processor.

def get_preds(batch):
    padded_batch = processor.pad([{"input_values": batch["input_values"]}],
                                  return_tensors="pt")
    
    logits = model(**padded_batch).logits
    predicted_ids = torch.argmax(logits, dim=-1)
    transcription = processor.batch_decode(predicted_ids)[0]
    
    return transcription

I also decided to check results on my own speech. I don’t stutter but my pronunciation is not perfect. The intention was to check whether model suffers from stuttering speech or just does not understand certain phrases.

results of wav2vec 2.0 on stuttering and my speech

Whisper

The new ASR model Whisper was released in 2022 and showed state-of-the-art results to this moment. The main purpose was to create an ASR model that ‘works reliably without the need for dataset specific fine-tuning to achieve high-quality results on specific distributions’. It was trained on 680,000 hours of multilingual and multitask supervised data collected from the web. Such amount of data for supervised training has never been used before.
The Whisper architecture is based on encoder-decoder Transformer. The decoder makes use of other parameters such as language, translation and other which it predicts.
Other interesting ideas behind this network I discussed in the previous post.

Luckily, OpenAI made prediction interface easy to use, so we just load the model and then transcribe samples.

!pip install git+https://github.com/openai/whisper.git

import whisper
model_whisper = whisper.load_model("medium")

predicted_texts = []
for audio_path in tqdm(stuttering_dataset['audio_paths'].to_list()):
  prediction = model_whisper.transcribe(audio_path)["text"]
  predicted_texts.append(prediction)

results of whisper on stuttering and my speech

DeepSpeech

DeepSpeech was the first breakthrough in ASR pipelines. The model is based on RNN neural network and additional n-gram language model. That’s notable that the first model was much simpler than current models but achieved relatively high results indeed. They have also used quite a unique technique of jittering on audio files: “translate the raw audio files by 5ms (half the filter bank step size) to the left and right, then forward propagate the recomputed features and average the output probabilities.”

To use DeepSpeech it is important to first download it and its corresponding language model.

!pip install -U deepspeech
!sudo apt-get install sox 

!mkdir -p ./some/workspace/path/ds093
!cd ./some/workspace/path/ds093/
!curl -LO https://github.com/mozilla/DeepSpeech/releases/download/v0.9.3/deepspeech-0.9.3-models.pbmm
!curl -LO https://github.com/mozilla/DeepSpeech/releases/download/v0.9.3/deepspeech-0.9.3-models.scorer

Then we read audio file, transform it to 16-bit int array and get predictions of speech-to-text.

def predict_deepspeech(audio_path, model):
  w = wave.open(audio_path, 'r')
  buffer = w.readframes(frames)

  data = np.frombuffer(buffer, dtype=np.int16)

  predicted_text = model.stt(data)

  return predicted_text

article

Results

Although the best performing model according to WER is DeepSpeech, I was more fascinated by the results of Whisper as it captured all of the details in speech adding ‘T-T-T’ and pauses for ‘…’. Moreover, Whisper is the most recent model, so I guess we’re moving in the right direction where all technologies are available no matter of your health conditions.

Code is also available on kaggle by the link. Notice that data is accessible only in dataset itself.