Deploying “whisper” from HF Model Hub onto Azure ML

Published in

Microsoft Azure

4 min readMar 25, 2024

Whisper is a transformer based encoder-decoder model with a “deceptively simple” architecture comprising of a stack of 2D CNNs followed by a symmetric transformer encoder/decoder stack.

Whisper model architecture source: OpenAI, *Robust Speech Recognition via Large-Scale Weak Supervision*

Unlike standard ASR (automated speech recognition) models Whisper employs a unique generative inference procedure. Decoder uses audio embeddings of 30-sec voice segments as well as generations from previous segments for inference. Leveraging attention mechanisms with transformers whisper contextual understanding is superior to OTC ASR models. When the contextual richness and diversity of its training data is taken into consideration whisper performs better in challenging audio conditions e.g. challenging audio conditions, different accents, dialects and background noise.

Currently there are three ways you can deploy whisper on Azure.

Whisper on Azure OpenAI
Whisper on Azure Speech
Whisper on AzureML Opensource Whisper model deployed from HuggingFace model hub onto AzureML.

Whisper on AzureOpenAI is usually for simpler use-cases with files <25MB supported and is currently limited with 3k tokens / region quota.

Using Whisper with Azure Speech gives certain capabilities such as diarization (keeping track of individual speakers), word level timestamps (important to synch captions with audio / video) , batch Inference API (files <1GB) and an upcoming capability for fine-tuning.

What if you wouldn’t like to be bound with regional service quotas and manage your own infrastructure for more deterministic performance / latency on Azure? With Provisioned Throughput option for whisper not visible on the horizon you can deploy the Open Source Whisper model from HuggingFace model hub on Azure ML. Below is a step by step description as to how to deploy the whisper model on Azure ML.

Go to ml.azure.com, and filter out the whisper models in the model catalog.

Next choose one of the models and deploy as a “real-time endpoint”.

Once you select your compute cluster, you will have the model deployed as a standard REST API. You can see the API EP and the API key under the “consume” tab in “Endpoint” settings. Your endpoint is now ready for accepting inference requests…

Below is a sample inference code you can use to test the endpoint…

import urllib.request
import json
import os
import ssl

def allowSelfSignedHttps(allowed):
    # bypass the server certificate verification on client side
    if allowed and not os.environ.get('PYTHONHTTPSVERIFY', '') and getattr(ssl, '_create_unverified_context', None):
        ssl._create_default_https_context = ssl._create_unverified_context

allowSelfSignedHttps(True) # this line is needed if you use self-signed certificate in your scoring service.

# Request data goes here
# The example below assumes JSON formatting which may be updated
# depending on the format your endpoint expects.
# More information can be found here:
# https://docs.microsoft.com/azure/machine-learning/how-to-deploy-advanced-entry-script
data =  {
   "input_data": {
       "audio": ["https://www2.cs.uic.edu/~i101/SoundFiles/gettysburg.wav", "https://www2.cs.uic.edu/~i101/SoundFiles/preamble.wav"],
       "language": ["en", "en"]
   }
}

body = str.encode(json.dumps(data))

url = 'https://XXX/score'
# Replace this with the primary/secondary key or AMLToken for the endpoint
api_key = 'XXX'
if not api_key:
    raise Exception("A key should be provided to invoke the endpoint")

# The azureml-model-deployment header will force the request to go to a specific deployment.
# Remove this header to have the request observe the endpoint traffic rules
headers = {'Content-Type':'application/json', 'Authorization':('Bearer '+ api_key), 'azureml-model-deployment': 'openai-whisper-large-15' }

req = urllib.request.Request(url, body, headers)

try:
    response = urllib.request.urlopen(req)

    result = response.read()
    print(result)
except urllib.error.HTTPError as error:
    print("The request failed with status code: " + str(error.code))

    # Print the headers - they include the requert ID and the timestamp, which are useful for debugging the failure
    print(error.info())
    print(error.read().decode("utf8", 'ignore'))

Hope this helps and please reach out if you have any questions on running whisper on Azure.

References:

December 2022, OpenAI Robust Speech Recognition via Large-Scale Weak Supervision [link]
Whisper section — Azure Speech Service documentation
Whisper model via Azure AI Speech or via Azure OpenAI Service
Getting Started with OpenAI Whisper

Deploying “whisper” from HF Model Hub onto Azure ML

Written by Ozgur Guler