Whisper Goes to Wall Street: Scaling Speech-to-Text with Ray on Vertex AI — Part I

Ivan Nardini
Google Cloud - Community
10 min readJun 25, 2024

Introduction

Transcribing audio recordings is one of the most common (and exciting) language processing tasks. Despite well-known limitations, Whisper is still one of the most attractive models when you want to build speech recognition and translation applications.

While Whisper exhibits exceptional performance in transcribing and translating high-resource languages, its accuracy is poor for low-resource languages. These languages typically have limited training data, which can be encountered in specific domains. For instance, transcribing customer interactions in a bank could be challenging for Whisper due to the diversity of spoken languages and the nature of the conversations. To improve the performance of Whisper, you can fine-tune the model on limited data.

This article is the part I of a blog series about tuning and serving Whisper with Ray on Vertex AI. This blog shows how Ray on Vertex AI enables scaling the training process for transformer models like Whisper. By the end of this article, you’ll have a deeper understanding of the distributed process for fine-tuning a foundational model on Ray within Vertex AI. The article is based on the Hugging Face Audio course and the Whisper Fine-tuning Event content. A basic knowledge of the HuggingFace ecosystem including Transformers and its Deepspeed integration is assumed. Also, if you’re not familiar with Ray on Vertex AI, check out this Medium article list for a quick introduction to Ray on Vertex AI.

Distributed training with HuggingFace, Deepspeed and Ray on Vertex AI

Efficient training of large transformer models like Whisper may require a significant amount of GPU memory. For example, whisper-small, a 244M model, requires, at most, 36GB memory available. This memory is mainly allocated to the model and its training stages, which include gradient calculation (gradients), backward pass (activations, gradients of activations) and the optimizer step (for example, Adam optimizer stores not only the current gradients but also past gradients and their squares). Depending on your GPU, this may cause the CUDA ‘out-of-memory’ error.

Addressing this potential issue may require some parallelism with different techniques such as data parallelism, tensor parallelism, and pipeline parallelism. Looking at data parallelism, it’s a divide and conquer parallelism technique, which involves splitting the training data into multiple batches. Each batch is processed independently by a different worker (for example, GPU) where model parameters are replicated. Each worker calculates gradients based on their assigned batch of data. And these gradients are averaged across all workers to update the model parameters. An example of data parallelism implemented is DistributedDataParallel (DDP), which is a module designed to distribute the training of models using Pytorch.

While the data parallelism technique is simple to implement and scales well with the number of workers, it may not solve the ‘out-of-memory’ issue with the Whisper model.

DeepSpeed is an open-source deep learning optimization library for PyTorch for training massive large models even with only one GPU. DeepSpeed leverages the ZeRO (Zero Redundancy Optimizer) which eliminates memory redundancy with data parallelism. ZeRO is capable of dividing optimizer states (ZeRO Stage 1), gradients (ZeRO Stage 2) and model parameters (ZeRO Stage 3) into smaller fractions that can be distributed across multiple GPUs or even offloaded to CPUs or disk storage (Optimizer and Param Offload). This frees up a significant amount of GPU memory, letting you train much larger models. HuggingFace provides the Transformers library that integrates PyTorch DistributedDataParallel (DDP) and Deepspeed to distribute training of Transformers models.

While HuggingFace with Pytorch DDP and Deepspeed makes distributed training efficient, they don’t have built-in mechanisms for launching and managing the training processes among workers. This is particularly relevant with respect to resource management and fault tolerance. In this case, you need a distributed processing framework that lets you easily deploy and scale your training across a large number of workers and handles node failures. This would ensure the continuity of your training process, even on large clusters.

Ray is a Python open-source, scalable, and distributed computing framework designed to make it easy to build and run distributed ML applications. With respect to PyTorch DistributedDataParallel (DDP) and Deepspeed, Ray serves as a powerful orchestration layer, simplifying scalability across clusters and providing fault tolerance. With Ray on Vertex AI, you can focus on developing and scaling your AI workloads without managing the required infrastructure. Below is a representation of the relationship between HuggingFace, Deepspeed and Ray on Vertex AI.

Figure 1 — Behind the scenes — Image from author

Now that you know how Distributed training with HuggingFace, Deepspeed and Ray works, let’s see how you can leverage Vertex AI to tune Whisper using Ray on Vertex AI.

Tune Whisper using HuggingFace, DeepSpeed and Ray on Vertex AI

To tune Whisper using Ray on Vertex AI, start with creating a Ray cluster. Ray on Vertex AI lets you deploy a customized Ray cluster on Vertex AI using a custom container image.

Having a custom Ray cluster lets you add Python dependencies that aren’t included in the prebuilt container images.

To create a customized Ray cluster, you can build a Docker container image using Cloud Build with Ray on Vertex prebuilt images as the base image. Then, you register the image in the Artifact Registry. Below are the requirements, Dockerfile and the gcloud commands you’d have under this scenario.

# Requirement file
./requirements.txt
--extra-index-url https://download.pytorch.org/whl/cu118
ipython==8.22.2
torch==2.2.1
torchaudio==2.2.1
ray==2.10.0
ray[data]==2.10.0
ray[train]==2.10.0
datasets==2.17.0
transformers==4.39.0
evaluate==0.4.1
jiwer==3.0.0
accelerate==0.28.0
deepspeed==0.14.0
soundfile==0.12.1
librosa==0.10.0
pyarrow==15.0.2
fsspec==2023.10.0
gcsfs==2023.10.0
etils==1.7.0

# Dockerfile file
./Dockerfile
FROM us-docker.pkg.dev/vertex-ai/training/ray-gpu.2-9.py310:latest
ENV PIP_ROOT_USER_ACTION=ignore
COPY requirements.txt .
RUN pip install -r requirements.txt

# Create a Docker image repository
! gcloud artifacts repositories create your-repo --repository-format=docker --location='your-region' --description="Tutorial repository"

# Build the image
! gcloud builds submit --region='your-region' --tag=your-region-docker.pkg.dev/your-project/your-repo/train --machine-type=your-build-machine --timeout=3600 ./

In the following, you have the resulting image in the Artifact registry.

Figure 2 — Cluster image in Artifact registry— Image from author

Once you have the customized Ray cluster image, create the associated Ray cluster using the Vertex AI SDK for Python. Assuming that the entire cluster shares the same image, you can set the custom container using NodeImages as shown below.

from vertex_ray import NodeImages, Resources

HEAD_NODE_TYPE = Resources(
machine_type='your-head-machine-type',
node_count=1
)

WORKER_NODE_TYPES = [
Resources(
machine_type='your-worker-machine-type',
node_count=3,
accelerator_type='your-accelerator-type',
accelerator_count=1
)
]

CUSTOM_IMAGES = NodeImages(
head='your-region-docker.pkg.dev/your-project/your-repo/train',
worker='your-region-docker.pkg.dev/your-project/your-repo/train',
)

# create cluster
ray_cluster_name = vertex_ray.create_ray_cluster(
head_node_type=HEAD_NODE_TYPE,
worker_node_types=WORKER_NODE_TYPES,
custom_images=CUSTOM_IMAGES,
cluster_name=CLUSTER_NAME,
)


# [Ray on Vertex AI]: Cluster State = State.PROVISIONING
# Waiting for cluster provisioning; attempt 1; sleeping for 0:02:30 seconds
# ...
# [Ray on Vertex AI]: Cluster State = State.RUNNING

After the Ray cluster is created, you get access to the cluster in the Google Cloud console.

Figure 3— Ray Cluster in Ray on Vertex AI UI — Image from author

With the Ray cluster up and running, you can focus on developing the Whisper distributed training application using HuggingFace, Deepspeed and Ray Train. Start by defining a training function that wraps the HuggingFace Trasformers code for the Whisper model as you can see in the pseudocode below.

# Training libraries
import config as constants
import os
import torch
from huggingface_hub import login
from datasets import Audio, load_dataset, concatenate_datasets
from transformers import set_seed
from transformers import WhisperTokenizer, WhisperProcessor, WhisperForConditionalGeneration,
from transformers import Seq2SeqTrainingArguments, Seq2SeqTrainer
import evaluate

# Ray libraries
import ray.train.huggingface.transformers as ray_transformers

# Define the training function
def train_func(config):

# Read dataset
dataset = load_dataset(...)

# Preprocess dataset
...
train_dataset = dataset["train"]
eval_dataset = dataset["test"]

# Setup training
training_args = Seq2SeqTrainingArguments(
constants.MODEL_ID,
...
deepspeed=constants.DEEPSPEED_CONFIG
)

trainer = Seq2SeqTrainer(...)

callback = ray_transformers.RayTrainReportCallback()
trainer.add_callback(callback)
trainer = ray_transformers.prepare_trainer(trainer)
trainer.train()

The HuggingFace’s Whisper tuning code leverages the integration with DeepSpeed. You can find the DS configuration in this GitHub repo. Ray Train requires adding some logic to hand over data, checkpoints and metrics to Ray with the RayTrainReportCallback and the prepare_trainer methods.

Once you have the training function, use Ray Train to define how to distribute your tuning job. The following is an example of the pseudocode to distribute Whisper tuning using Ray Train for ASR application.

import ray.train.huggingface.transformers
from ray.train import ScalingConfig, RunConfig, CheckpointConfig
from ray.train.torch import TorchTrainer

def main():

# Initialize ray session
ray.init()

# Training config
train_loop_config = {...}
scaling_config = ScalingConfig(...)
run_config = RunConfig(checkpoint_config=CheckpointConfig(...)

trainer = TorchTrainer(
train_loop_per_worker=train_func,
train_loop_config=train_loop_config,
run_config=run_config,
scaling_config=scaling_config
)
result = trainer.fit()
ray.shutdown()

After initiating the Ray session, define both the train_loop_config, scaling_config and run_config parameters to specify the tuning parameters, the desired number of workers, the number of GPUs, checkpointing and synchronization behaviors. Then, you initialize TorchTrainer for data parallel PyTorch training. The Trainer launches multiple workers (scaling_config), and sets up the execution of the training run (run_config) and, finally, runs the training (train_loop_per_worker) with its configuration (train_loop_config) along with all workers.

At this point, you can launch the distributed tuning job on Ray on Vertex AI. Ray on Vertex AI supports several ways to run a Ray application on a cluster. The following is pseudocode for submitting the tuning job of the Whisper model using the Ray Jobs API via Ray dashboard.

At this point, you can launch the distributed tuning job on Ray on Vertex AI. Ray on Vertex AI supports several ways to run a Ray application on a cluster. The following is pseudocode for submitting the tuning job of the Whisper model using the Ray Jobs API via Ray dashboard.

# Initiate the client
client = JobSubmissionClient(
address="vertex_ray://{}".format(ray_cluster.dashboard_address)
)

# Define train entrypoint
train_entrypoint=f"python3 trainer.py --experiment-name=your-experiment-name --num-workers=3 --use-gpu"
train_runtime_env={
"working_dir": "./train_script_folder",
"env_vars": {
"HF_TOKEN": HF_TOKEN,
"TORCH_NCCL_ASYNC_ERROR_HANDLING": "3"},
}

# Submit the job
job_status = submit_and_monitor_job(
client=client,
submission_id=train_submission_id,
entrypoint=train_entrypoint,
runtime_env=train_runtime_env
)

After submitting the job, you can monitor the distributed training using the Ray OSS Dashboard on Ray on Vertex AI.

Figure 4— Ray Dashboard— Image from author

Also, you can monitor the status of the tuning (loss function, metrics, interactions and more) with the integration with Vertex AI TensorBoard. While the tuning process is running, you can load the TensorBoard stored logs from Cloud Storage in a Vertex AI TensorBoard instance using the following code:

# Get log in Vertex AI tensorboard
vertex_ai.upload_tb_log(
tensorboard_id=tensorboard.name,
tensorboard_experiment_name="your-tb-experiment",
logdir="gs://your-bucket-name/your-log-dir")

The following dashboard shows the tuning process.

Figure 5— Vertex AI Tensorboard— Image from author

After the distributed tuning ends, use the `ExperimentAnalysis` method to retrieve the uri of the best checkpoint according to relevant metrics and mode. In this case, you want the checkpoint which minimizes the Word Error Rate (WER) metric.

from ray.tune import ExperimentAnalysis

experiment_analysis = ExperimentAnalysis('gs://your-bucket-name/experiments/train')
log_path = experiment_analysis.get_best_trial(metric="test_wer", mode="min")
model_checkpoint = experiment_analysis.get_best_checkpoint(log_path, metric="eval_wer", mode="min")

With the best checkpoint, you can test it and validate the tuning process using a reference evaluation dataset of expected transcriptions. The following is an example of evaluation.

import evaluate 

# Load the WER metric
wer = evaluate.load("wer")

# Compute the metrics with base and tuned model
whisper_wer_metric = wer.compute(predictions=[trascriptions['whisper']],
references=[trascriptions['reference']])

tuned_whisper_wer_metric = wer.compute(predictions=[trascriptions['tuned_whisper']],
references=[trascriptions['reference']])

# Calculate error % difference
error_difference = ((tuned_whisper_wer_metric - whisper_wer_metric) / whisper_wer_metric) * 100
print("Base Whisper vs Tuned Whisper - WER differences")
print(f"WER difference: {error_difference:.3f}%")

# WER difference: -31.345%

Conclusions

This article presents a compelling challenge: how to tune Whisper to better transcribe banking user interactions. As of today, Whisper is still one of the most attractive models to solve this challenge. But, it requires extensive computing resources for adapting it to your application.

This article shows how to speed up Whisper tuning using HuggingFace, DeepSpeed and Ray on Vertex AI. By now, you should have a better understanding of how HuggingFace, DeepSpeed and Ray form a powerful stack for training language models while Vertex AI provides a scalable and managed infrastructure for seamless development and deployment of your AI application, at scale.

If you’re interested in exploring Ray on Vertex AI, I highly recommend checking out the Vertex AI documentation. Additionally, I encourage you to check out the following Medium blog series on this topic!

Scale AI on Ray on Vertex AI Series

This article is part of the Scale AI on Ray on Vertex AI series where you learn more about how to scale your AI and Python applications using Ray on Vertex.

And, follow me, as more exciting content is coming your way!

  1. Scale AI on Ray on Vertex AI: Let’s get it started
  2. Is it Pop or Rock? Classify songs with Hugging Face 🤗 and Ray on Vertex AI

Thanks Ann Farmer for feedback and suggestions!

Thanks for reading

I hope you enjoyed the article. If so, please clap or leave your comments. Also let’s connect on LinkedIn or X to share feedback and questions 🤗

References

--

--

Ivan Nardini
Google Cloud - Community

Customer Engineer at @GoogleCloud who is passionate with Machine Learning Engineering. The Lead of MLOps.community’s Engineering Lab.