Whisper Goes Wall Street: Serving Speech-to-Text with Ray Serve and Cloud Run — Part II

Published in

Google Cloud - Community

8 min readJul 26, 2024

Transcribing audio recordings is one of the most common (and exciting) language processing tasks. Despite well-known limitations, Whisper is still one of the most attractive models when building speech recognition and translation applications.

While Whisper exhibits exceptional performance in transcribing and translating high-resource languages, its accuracy is poor for languages not having a lot of resources (i.e., documents) to train on. To improve Whisper’s performance, you can fine-tune a model on limited data. But improving Whisper’s performance would require extensive computing resources for adapting the model to your application. In the part I of this blog series about tuning and serving Whisper with Ray on Vertex AI, you learn how to speed up Whisper tuning using HuggingFace, DeepSpeed and Ray on Vertex AI to improve audio transcribing in a banking scenario.

Serving the tuned Whisper model has significant computational demands (often GPUs or specialized AI accelerators) during inference (the process of generating transcriptions from audio). Running the model efficiently requires those resources to be scalable to address the volume of requests. As today, Ray on Vertex AI does not support Ray Serve.

This article explores the integration of Ray Serve and Cloud Run for serving a fine-tuned Whisper model on Google Cloud. The primary goal here is to introduce Ray Serve and Cloud Run rather than provide a template for production use cases. By the end of the article, you should have a better understanding of how Ray Serve and Cloud Run can offer a user-friendly interface and an easy-to-use infrastructure for serving models like Whisper at scale.

The article is based on the Hugging Face Audio course and its content. The article requires a basic knowledge of the HuggingFace ecosystem including Transformers. Also, if you’re not familiar with Ray on Vertex AI, check out this Medium article list for an introduction to Ray on Vertex AI.

Ray Serve + Cloud Run = ❤️

If you’ve attempted to deploy a model to production, you may have encountered several challenges. Initially, you consider web frameworks like Flask or FastAPI on virtual machines for easy implementation and rapid deployment. However, achieving high performance and low cost in production environments may be challenging. To optimize performance efficiently, you consider building your own model server using technologies like TensorFlow, Torchserve, Rust, and Go, running on Docker and Kubernetes. Mastering this stack offers you portability, reproducibility, scalability, reliability, and control. However, its steep learning curve limits accessibility for many teams. Finally, you look at specialized systems like Seldon, BentoML and KServe, designed for serving in production. However, these frameworks may limit flexibility, making development and management complex.

To solve this serving dilemma, you need a model serving framework and infrastructure that seamlessly integrates with your existing Python-based ML workflows, lets you scale in an efficient way, and empowers you with the flexibility to deploy diverse models with complex business logic in production.

Ray Serve is a powerful model serving framework built on top of Ray, a distributed computing platform. With Ray Serve, you can easily scale your model serving infrastructure horizontally, adding or removing replicas based on demand. This ensures optimal performance even under heavy traffic. Ray Serve has been designed to be a Python-based agnostic framework, which means you serve diverse models (for example, TensorFlow, PyTorch, scikit-learn) and even custom Python functions within the same application using various deployment strategies. In addition, you can optimize model serving performance using stateful actors for managing long-lived computations or caching model outputs and batching multiple requests to your models.To learn more about Ray Serve and how it works, check out Ray Serve: Scalable and Programmable Serving.

Cloud Run is a serverless platform you can use for model deployment. With Cloud Run, you focus on your serving model code and simply provide a containerized application. Cloud Run handles scaling and resource allocation automatically. Because of that, Cloud Run enables swift deployment of your model services, accelerating time to market. With its pay-per-use model, you only pay for the resources consumed during request processing, making it an economical choice for many use cases. You can find more information about Cloud Run in the Google Cloud documentation.

Together Ray Serve and Cloud Run can offer a great solution for ML model serving. You get the flexibility and scalability of Ray Serve coupled with the ease of use, rapid deployment and cost-effectiveness of Cloud Run. This combination would empower you to build robust, high-performance model serving systems without the complexity of managing underlying infrastructure.

Figure 1 — Ray Serve running on Cloud Run — Image from author

Now that you know how to benefit from Ray Serve and Cloud Run, let’s see how you can serve tuned Whisper to better transcribe banking user interactions.

Serving tuned Whisper with Ray Serve on Cloud Run

To serve Whisper with Ray Serve on Cloud Run, you can leverage the integration of Gradio with Ray Serve to scale the model in an ASR application. Start preparing the serving script as shown below.

# Import libraries
from transformers import pipeline
import ray
from ray import serve
from ray.serve.gradio_integrations import GradioServer
import gradio as gr
import time


def gradio_transcriber_builder():

    stt_pipeline = pipeline(model="./checkpoint",
                            task="automatic-speech-recognition",
                            chunk_length_s=30)

    def transcribe(audio, record):

        # Get the audio file
        if audio:
            filepath = audio
        else:
            filepath = record

        # Run prediction
        prediction = stt_pipeline(
            filepath,
            generate_kwargs={
                "task": "transcribe",
                "language": "en",
            },
            return_timestamps=False,
        )['text']

        return prediction

    return gr.Interface(
        fn=transcribe,
        inputs=[
            gr.Audio(sources="microphone", type="filepath", label="Record"),
            gr.Audio(sources="upload", type="filepath", label="Audio file"),
        ],
        outputs=gr.Textbox(label='Trascription'),
        title='ASR with Tuned Whisper model',
        description='This demo shows how to use the tuned Whisper model to transcribe audio.',
        allow_flagging='never',
        theme=gr.themes.Base(),
    )


# Start Ray
ray.shutdown()
ray.init(_node_ip_address="0.0.0.0")
serve.start(
    http_options={"host": "0.0.0.0", "port": 8080}
)
app = GradioServer.options(ray_actor_options={"num_cpus": 8}).bind(
    gradio_transcriber_builder
)

serve.run(app)

while True:
    time.sleep(5) # to fix a startup issue

As you see, you can define a gradio_transcriber_builder function, which returns a Gradio application using the HuggingFace Transformer pipeline to generate transcription either using an audio path or an audio file directly. Using the integration of Gradio with Ray Serve, you need to bind the Gradio ASR application within a Serve deployment. This deployment serves as an abstract container for the fine-tuned Whisper model and it efficiently handles incoming requests and scales up across a Ray cluster, ensuring the model can handle a higher volume of requests. Ray Serve provides a GradioServer class which wraps the Gradio ASR app and lets you serve the app as HTTP server on Ray Serve and scale it without changing your code. In fact, you can directly define the number of resources (CPU and/or GPU) available to the application.

Once you create the Ray Serve application, you can build a Docker container image using Cloud Build, which you then register in the Artifact Registry. Below are the requirements, Dockerfile and the gcloud commands you’d have under this scenario.

# Requirement file
./requirements.txt
torch==2.2.1
ray==2.10.0
ray[serve]==2.10.0
transformers==4.39.0
soundfile==0.12.1
ffmpeg==1.4
gradio==4.19.2

# Dockerfile file
./Dockerfile
FROM rayproject/ray:2.10.0

# Install dependencies.
RUN sudo apt-get update -y && sudo apt-get install -y ffmpeg

# Install training libraries.
ENV PIP_ROOT_USER_ACTION=ignore
WORKDIR /app
COPY requirements.txt .
RUN pip install -r requirements.txt

# Copy files
COPY checkpoint checkpoint
COPY serve_predictor_script.py serve_predictor_script.py

# Expose the port.
EXPOSE 8080

# Run the training script.
CMD bash -c 'python serve_predictor_script.py'

# Create a Docker image repository
gcloud artifacts repositories create your-repo --repository-format=docker --location='your-region' --description="Tutorial repository"

# Build the image
gcloud builds submit --region='your-region' --tag=your-region-docker.pkg.dev/your-project/your-repo/serve --machine-type=your-build-machine --timeout=3600 ./

In the following, you have the resulting image in the Artifact registry.

Figure 2 — Serving image in Artifact registry — Image from author

Once you have your serving image using Ray Serve, you can deploy the Gradio application using Cloud Build command line as shown below.

gcloud run deploy tuned-whisper-bank-tn-it - image={your-serving-image} \
 - cpu=8 - max-instances=2 - memory=4Gi - port=8080 \
 - region={your-region} - project={your-project} \
 - allow-unauthenticated

It’s important to define the correct amount of resources to allocate to the application due to high resource usage of the Whisper model. In this case, you allocate 8 vCPUs (virtual CPUs) to each instance (in this case, 2 instances). Also, you allocate 4 GBs of memory to each instance. For demonstration purposes, you make your service accessible to the public without requiring users to log in ( — allow-unauthenticated flag). Use this with caution, especially if your service handles sensitive data.

After you deploy the Gradio application, you can monitor the application from the Cloud Run logging control.

Figure 3— Cloud Run logging — Image from author

If you visit the URL application, you have an ARS application like the one below ready to transcribe your audio as shown below.

Figure 4 — ASR Gradio application — Image from author

Conclusions

This article is part II of a blog series about tuning and serving Whisper with Ray on Vertex AI. This article explored the integration of Ray Serve and Cloud Run for serving a fine-tuned Whisper model on Google Cloud.

It’s important to highlight that serving the model using Ray Serve on Cloud Run is only one possibility and it should be considered for experimentation only. You may experience a temporary interruption in model service and some consequential delays in the rolling out of new ASR application instances. Also, at the time of writing, Cloud Run does not support GPUs.

With that, you’re interested in exploring Ray on Vertex AI, I highly recommend checking out the Vertex AI documentation. Additionally, I encourage you to check out the following Medium blog series on this topic!

Scale AI on Ray on Vertex AI Series

This article is part of the Scale AI on Ray on Vertex AI series where you learn more about how to scale your AI and Python applications using Ray on Vertex.

And, follow me, as more exciting content is coming your way!

Thanks for reading

I hope you enjoyed the article. If so, please clap or leave your comments. Also let’s connect on LinkedIn or X to share feedback and questions 🤗

Thanks Ann Farmer for feedback and suggestions!