Getting started with Chirp, the Google’s Universal Speech Model (USM) on Vertex AI

Ivan Nardini

Published in

Google Cloud - Community

6 min readJun 27, 2023

written in collaboration with G. Hussain Chinoy

Introduction

Foundational models are impacting the way we solve real-world challenges including Automatic Speech Recognition (ASR). On Google Vertex AI, you can access Chirp, a 2B-parameter speech model which achieved incredible accuracy improvements in speech recognition over several languages.

As a developer interested in audio, you might be curious about how Chirp works and how to apply Chirp for building GenAI applications. Currently, the Chirp programming interface is still under development, so documentation does not currently include instructions on using it via the programming interface.

This article provides Chirp’s three most exciting facts and a step-by-step guide on how to get started with Chirp on Vertex AI using the Cloud Speech-to-Text API (v2). This article also shows you how to use Chirp to transcribe long audio files (more than one minute). By the end, you will know everything you need to use Chirp for long audio transcriptions.

Three most exciting facts about Chirp

Before you get your hands dirty with Chirp, there are three reasons to be excited about Chirp. First of all, Chirp is built via self-supervised training on millions of hours of audio and 28 billion sentences of text spanning 100+ languages. Secondly, Chirp is not only a 2B-parameter model. It is trained using new approaches. Chirp’s encoder uses unsupervised audio data from 100+ languages and the model was then fine-tuned for transcription in each specific language with small amounts of supervised data. Lastly, Chirp achieved large quality improvements in languages and accents with 98% speech recognition accuracy in English and over 300% in other languages. If you want to learn more about USM behind Chirp, see the Google Research paper here. And now, let’s see how to use Chirp for long audio transcription.

Getting started with Chirp using Cloud Speech-to-Text API (v2)

To transcribe long audio using Cloud Speech-to-Text API (v2), you have to cover the following steps:

Upload audio files in a Google Cloud Storage Bucket
Create a Chirp Recognizer
Create a Batch transcription request
Submit the transcription request and get transcriptions from Google Cloud Bucket

Step 1: Upload audio files in a Google Cloud Storage Bucket

To process long audio files, Chirp requires uploading them on Google Cloud Bucket. In this tutorial, you use vr.wav which is an audio file you can find in the cloud-samples-tests bucket.

long_audio_path = '<your-local-path>/data/audio.wav'
long_audio_uri = 'gs://cloud-samples-tests/speech/vr.wav'
!gsutil cp {long_audio_uri} {long_audio_path}

Make sure your audio is longer than one minute and it has one of the supported audio formats. For example, you can use librosa to validate the audio length as shown below.

import librosa

long_audio_duration = librosa.get_duration(path=long_audio_path)
if long_audio_duration < 60:
  raise Exception(f"The audio is less than 1 min. Actual length: {long_audio_duration}")

Step 2: Create a Chirp Recognizer

Create a Recognizer by running a CreateRecognizerRequest. The CreateRecognizerRequest defines a recognizer configuration and some associated metadata, including a name (id), the model you want to use, and the language of the audio to process (BCP-47 language tag). Below you can see how to create a Recognizer for a Chirp model for audio files in the English language.

from google.cloud.speech_v2 import SpeechClient
from google.cloud.speech_v2.types import cloud_speech

language_code = 'en-US'
recognizer_id = f'chirp-{language_code.lower()}'

# initialize client 
client = SpeechClient(client_options=ClientOptions(
    api_endpoint=f'your-region-speech.googleapis.com'))

# create recognizer request
recognizer_request = cloud_speech.CreateRecognizerRequest(
        parent="projects/your-project-id/locations/your-region",
        recognizer_id=recognizer_id,
        recognizer=cloud_speech.Recognizer(
            language_codes=[language_code],
            model="chirp",
        ),
      )

# create recognizer
create_operation = client.create_recognizer(request=recognizer_request)
recognizer = create_operation.result()

Notice that Chirp is only available in the us-central1 region. It supports the following languages. And it is possible to define default configuration to use for multiple requests with the recognizer. See the documentation for more.

Step 3: Create a Batch transcription request

After you create the Recognizer, process your long audio files with a batchRecognize request. That request allows you to process long audio from 1 min to 8 hrs. To run the request, you need to set:

a recognition configuration, which indicates features and audio metadata
audio files with file metadata
a recognition output configuration (where to output the transcripts of each file)

In the following example, you can see how to define the batch recognition request.

long_audio_recognition_config = cloud_speech.RecognitionConfig(
    features = cloud_speech.RecognitionFeatures(
        enable_automatic_punctuation=True,
        enable_word_time_offsets=True
    ),
    auto_decoding_config={}
  )

long_recognition_output_config = {
        "gcs_output_config": {
            "uri": "your-bucket-uri/transcriptions"
        }
    }

long_audio_files = [{
        "config": long_audio_config,
        "uri": long_audio_uri
    }]

long_audio_request = cloud_speech.BatchRecognizeRequest(
    recognizer=recognizer.name,
    recognition_output_config=long_recognition_output_config,
    files=long_audio_files,
)


long_audio_operation = client.batch_recognize(request=long_audio_request)

At the time I am writing , Chirp only supports automatic punctuation and word timings. If you are looking for more advanced capabilities, check out Speech studio.

Step 4: Submit the transcription request and get transcriptions from the Google Cloud Bucket

After you define the request, you submit it using the result method . That method triggers a synchronous long-running operation.

long_audio_result = long_audio_operation.result()
print(long_audio_result)

After the operation successfully finishes, it returns the recognition results which would look like the one below.

results {
  key: '<your-bucket>/data/audio.wav'
  value {
    uri: "<your-bucket>/transcriptions/audio_transcript_xxx.json"
  }
}
total_billed_duration {
  seconds: 68
}

As you can see, the recognition results are stored in a JSON file which has the following schema including the transcript, word timing, and language:

[{'alternatives': [{'transcript': "your-transcript",
'words': [{'startOffset': '1.600s', 'endOffset': '1.800s', 'word': 'word1'},
     {'startOffset': '1.800s', 'endOffset': '2.160s', 'word': 'word2'},
     {'startOffset': '2.160s', 'endOffset': '2.280s', 'word': 'word3'},
     {'startOffset': '2.280s', 'endOffset': '2.360s', 'word': 'word4'},
     {'startOffset': '2.360s', 'endOffset': '2.480s', 'word': 'word5'},
     ...
     {'startOffset': '27.440s', 'endOffset': '27.960s', 'word': 'wordN'}]}],
  'languageCode': 'en-US'}
  {...}]

You can parse the file path to get your transcriptions. Below you have an example of how to get transcriptions if you mount cloud storage using gcs-fuse.

transcriptions_file_path = long_audio_result.results[long_audio_uri].uri.replace("gs://", "/gcs/")
transcriptions = json.loads(open(transcriptions_file_path, 'r').read())
transcriptions = transcriptions['results']
transcriptions = [transcription['alternatives'][0]['transcript'] for transcription in transcriptions]
long_audio_transcription = " ".join(transcriptions)
print(long_audio_transcription)

And here is the resulting transcription.

so okay, so what am I doing here? why am I here at GDC talking about VR video? um, it's because I believe um, my favorite games, I love games, I believe in games, my favorite games are the ones that are all about the stories, I love narrative game design, I love narrative-based games and I think that when it comes to telling stories in VR, bringing together capturing the world with narrative-based games and narrative-based game design,  is going to unlock some of the killer apps and killer stories of the medium, so I'm really here looking for people who are interested in telling those sort of stories, that are planning projects around telling those types of stories, um and I would love to talk to you, so if this sounds like your project, if you're looking at blending VR video and interactivity to tell a story, I want to talk to you, um, I want to help you, so if this sounds like you, please get in touch, please come find me, I'll be here all week, I have pink  I work for Google um and I would love to talk with you further about um VR video, interactivity and storytelling. So

As you can imagine, this transcription process can be automated. In the simplest configuration, you can use a combination of Google Cloud Storage, EventArc and Cloud Run to trigger a batch transcription request each time new audio files become available.

Summary

Chirp is a version of a Universal Speech Model that has over 2B parameters and can transcribe in over 100 languages in a single model. This article provided a step-by-step guide on how to get started with Chirp on Vertex AI using Cloud Speech-to-Text API (v2).

Based on some initial results, Chirp looks like a very promising foundational model. And it has several possible applications. You might use Chirp for content transcription or video captioning for providing subtitles.

In the coming article I will show you how to use Chirp in combination with PaLM 2 API and LangChain to build a content summarisation application. Also watch this git repository to know more on how to use, develop and manage Generative AI on Google Cloud.

In the meantime, I hope you found the article interesting. If so, clap or leave comments. And feel free to reach me on LinkedIn or Twitter for further discussion or if you have questions.