Fine-tune OpenAI’s Whisper Automatic Speech Recognition (ASR) Model

Published in

Graphcore

6 min readAug 9, 2023

Get the most out of Whisper by optimising if for new use cases, including better comprehension of specific languages and dialects, as well as technical and industry-specific terminology.

by Goran Katalinic

Whisper — the open source automatic speech recognition (ASR) model created by OpenAI — is incredibly powerful out of the box.

It is trained on 680,000 hours of labelled audio data, 117,000 hours of which cover 96 languages other than English, meaning that it can be applied to a wide range of applications with great results.

The vanilla version Whisper is available to run for inference in a Paperspace Gradient Notebook, powered by Graphcore IPUs.

There are also good reasons to fine-tune Whisper for a particular use case. This could include accounting for the complex and sometimes subtle differences in speech and vocabulary as influenced by:

A less common spoken language
Locale and dialect
A particular domain, such as scientific, medical, and legal

Where can I get audio data for fine-tuning Whisper?

Some organisations may have large amounts of proprietary audio data that can be used in the fine-tuning process. For others, gathering the audio necessary for fine-tuning is not a trivial undertaking.

Thankfully, there are several open-sourced speech recognition datasets available, covering multiple languages. The largest of these are:

Common Voice: 16,400 hours spanning 100 languages
Multilingual LibriSpeech: 6,000 hours for seven languages other than English
LibriSpeech: 1,000 hours, English only

There are smaller datasets covering many more languages and dialects, such as:

VoxPopuli: 1,800 hours, 16 languages
Fleurs: 12 hours per language, 102 languages
There are also individual datasets hosted by OpenSLR

In our Paperspace Gradient Notebook, we demonstrate fine-tuning using the Catalan subset of OpenSLR.

How to fine-tune Whisper on Graphcore IPUs

Get started by running the Whister Small Fine Tuning notebook on Paperspace.

For each code block below, you can simply click to run the block in Paperspace — making any modifications to code/parameters, where relevant. We explain how to run the process in environments other than Paperspace Gradient Notebooks at the end of this blog.

Install dependencies

# Install optimum-graphcore from source 
!pip install git+https://github.com/huggingface/optimum-graphcore.git@v0.7.1 "soundfile" "librosa" "evaluate" "jiwer"
%pip install "graphcore-cloud-tools[logger] @ git+https://github.com/graphcore/graphcore-cloud-tools"
%load_ext graphcore_cloud_tools.notebook_logging.gc_logger

import os

n_ipu = int(os.getenv("NUM_AVAILABLE_IPU", 4))
executable_cache_dir = os.getenv("POPLAR_EXECUTABLE_CACHE_DIR", "/tmp/exe_cache/") + "/whisper"

# Generic imports
from dataclasses import dataclass
from typing import Any, Dict, List, Union

import evaluate
import numpy as np
import torch
from datasets import load_dataset, Audio, Dataset, DatasetDict

# IPU-specific imports
from optimum.graphcore import (
    IPUConfig, 
    IPUSeq2SeqTrainer, 
    IPUSeq2SeqTrainingArguments, 
)
from optimum.graphcore.models.whisper import WhisperProcessorTorch

# HF-related imports
from transformers import WhisperForConditionalGeneration

Load dataset

Common Voice datasets consist of recordings of speakers reading text from Wikipedia in different languages. 🤗 Datasets enables us to easily download and prepare the training and evaluation splits.

First, ensure you have accepted the terms of use on the 🤗 Hub: mozilla-foundation/common_voice_13_0. Once you have accepted the terms, you will have full access to the dataset and be able to download the data locally.

dataset = DatasetDict()
split_dataset = Dataset.train_test_split(
    load_dataset("openslr", "SLR69", split="train", token=False), test_size=0.2, seed=0
)
dataset["train"] = split_dataset["train"]
dataset["eval"] = split_dataset["test"]
print(dataset)

The columns of interest are:

audio: the raw audio samples
sentence: the corresponding ground truth transcription.

We drop the path column.

dataset = dataset.remove_columns(["path"])

Since Whisper was pre-trained on audio sampled at 16 kHz, we must ensure the Common Voice samples are downsampled accordingly.

dataset = dataset.cast_column("audio", Audio(sampling_rate=16000))

Prepare Dataset

We prepare the datasets by extracting features from the raw audio inputs and injecting labels which are simply transcriptions with some basic processing.

The feature extraction is provided by 🤗 Transformers WhisperFeatureExtractor. To decode generated tokens into text after running the model, we will similarly require a tokenizer, WhisperTokenizer. Both of these are wrapped by an instance of WhisperProcessor.

MODEL_NAME = "openai/whisper-small"
LANGUAGE = "spanish"
TASK = "transcribe"
MAX_LENGTH = 224

processor = WhisperProcessorTorch.from_pretrained(MODEL_NAME, language=LANGUAGE, task=TASK)
processor.tokenizer.pad_token = processor.tokenizer.eos_token
processor.tokenizer.max_length = MAX_LENGTH
processor.tokenizer.set_prefix_tokens(language=LANGUAGE, task=TASK)

def prepare_dataset(batch, processor):
    inputs = processor.feature_extractor(
        raw_speech=batch["audio"]["array"],
        sampling_rate=batch["audio"]["sampling_rate"],
    )
    batch["input_features"] = inputs.input_features[0].astype(np.float16)

    transcription = batch["sentence"]
    batch["labels"] = processor.tokenizer(text=transcription).input_ids
    return batch

columns_to_remove = dataset.column_names["train"]
dataset = dataset.map(
    lambda elem: prepare_dataset(elem, processor),
    remove_columns=columns_to_remove,
    num_proc=1,
)

train_dataset = dataset["train"]
eval_dataset = dataset["eval"]

Lastly, we pre-process the labels by padding them with values that will be ignored during fine-tuning. This padding is to ensure tensors of static shape are provided to the model. We do this on the fly via the data collator below.

@dataclass
class DataCollatorSpeechSeq2SeqWithLabelProcessing:
    processor: Any

    def __call__(self, features: List[Dict[str, Union[List[int], torch.Tensor]]]) -> Dict[str, torch.Tensor]:
        batch = {}
        batch["input_features"] = torch.tensor([feature["input_features"] for feature in features])
        
        label_features = [{"input_ids": feature["labels"]} for feature in features]
        labels_batch = self.processor.tokenizer.pad(label_features, return_tensors="pt", padding="longest", pad_to_multiple_of=MAX_LENGTH)
        labels = labels_batch["input_ids"].masked_fill(labels_batch.attention_mask.ne(1), -100)

        batch["labels"] = labels

        return batch

Define metrics

The performance of our fine-tuned model will be evaluated using word error rate (WER).

metric = evaluate.load("wer")


def compute_metrics(pred, tokenizer):
    pred_ids = pred.predictions
    label_ids = pred.label_ids

    # replace -100 with the pad_token_id
    pred_ids = np.where(pred_ids != -100, pred_ids, tokenizer.pad_token_id)
    label_ids = np.where(label_ids != -100, label_ids, tokenizer.pad_token_id)

    pred_str = tokenizer.batch_decode(pred_ids, skip_special_tokens=True)
    label_str = tokenizer.batch_decode(label_ids, skip_special_tokens=True)

    normalized_pred_str = [tokenizer._normalize(pred).strip() for pred in pred_str]
    normalized_label_str = [tokenizer._normalize(label).strip() for label in label_str]

    wer = 100 * metric.compute(predictions=pred_str, references=label_str)
    normalized_wer = 100 * metric.compute(predictions=normalized_pred_str, references=normalized_label_str)

    return {"wer": wer, "normalized_wer": normalized_wer}

Load pre-trained model

model = WhisperForConditionalGeneration.from_pretrained(MODEL_NAME)

model.config.max_length = MAX_LENGTH
model.generation_config.max_length = MAX_LENGTH

Ensure language-appropriate tokens, if any, are set for generation. We set them on both the config and the generation_config to ensure they are used correctly during generation.

model.config.forced_decoder_ids = processor.tokenizer.get_decoder_prompt_ids(
    language=LANGUAGE, task=TASK
)
model.config.suppress_tokens = []
model.generation_config.forced_decoder_ids = processor.tokenizer.get_decoder_prompt_ids(
    language=LANGUAGE, task=TASK
)
model.generation_config.suppress_tokens = []

Fine-tuning Whisper on the IPU

The model can be directly fine-tuned on the IPU using the IPUSeq2SeqTrainer class.

The IPUConfig object specifies how the model will be pipelined across the IPUs.

For fine-tuning, we place the encoder on two IPUs, and the decoder on two IPUs.

For inference, the encoder is placed on one IPU, and the decoder on a different IPU.

replication_factor = n_ipu // 4
ipu_config = IPUConfig.from_dict(
    {
        "optimizer_state_offchip": True,
        "recompute_checkpoint_every_layer": True,
        "enable_half_partials": True,
        "executable_cache_dir": executable_cache_dir,
        "gradient_accumulation_steps": 16,
        "replication_factor": replication_factor,
        "layers_per_ipu": [5, 7, 5, 7],
        "matmul_proportion": [0.2, 0.2, 0.6, 0.6],
        "projection_serialization_factor": 5,
        "inference_replication_factor": 1,
        "inference_layers_per_ipu": [12, 12],
        "inference_parallelize_kwargs": {
            "use_cache": True,
            "use_encoder_output_buffer": True,
            "on_device_generation_steps": 16,
        }
    }
)

Lastly, we specify the arguments controlling the training process.

total_steps = 1000 // replication_factor
training_args = IPUSeq2SeqTrainingArguments(
    output_dir="./whisper-small-ipu-checkpoints",
    do_train=True,
    do_eval=True,
    predict_with_generate=True,
    learning_rate=1e-5 * replication_factor,
    warmup_steps=total_steps // 4,
    evaluation_strategy="steps",
    eval_steps=total_steps,
    max_steps=total_steps,
    save_strategy="steps",
    save_steps=total_steps,
    logging_steps=25,
    dataloader_num_workers=16,
    dataloader_drop_last=True,
)

Then, we just need to pass all of this together with our datasets to the IPUSeq2SeqTrainer class:

trainer = IPUSeq2SeqTrainer(
    model=model,
    ipu_config=ipu_config,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    data_collator=DataCollatorSpeechSeq2SeqWithLabelProcessing(processor),
    compute_metrics=lambda x: compute_metrics(x, processor.tokenizer),
    tokenizer=processor.feature_extractor,
)

To gauge the improvement in WER, we run an evaluation step before fine-tuning.

trainer.evaluate()

All that remains is to fine-tune the model! The fine-tuning process should take between 6 and 18 minutes, depending on how many replicas are used, and achieve a final WER of around 10%.

trainer.train()

Fine-tuning on IPUs in non-Paperspace environments

To run the Whisper Small fine-tuning demo using IPU hardware other than in a Paperspace Gradient Notebook, you need to have the Poplar SDK enabled.

Refer to the Getting Started guide for your system for details on how to enable the Poplar SDK. Also refer to the Jupyter Quick Start guide for how to set up Jupyter to be able to run this notebook on a remote IPU machine.

Conclusion

In this notebook, we demonstrated how to fine-tune Whisper for multi-lingual speech recognition and transcription on the IPU.

We used a single replica on a total of four IPUs. To reduce the fine-tuning time, more than one replica, hence more IPUs are required. On Paperspace, you can use either an IPU Pod16 or a Bow Pod16, both with 16 IPUs. Please contact Graphcore if you need assistance running on larger platforms.

For all available notebooks, check IPU-powered Jupyter Notebooks to see how IPUs perform on other tasks.

Have a question? Please contact us on our Graphcore community channel.