Summarize Audio Like a Maestro with Langchain, Chirp, and PaLM2 on Vertex AI

Ivan Nardini
Google Cloud - Community
10 min readJul 13, 2023
Langchain, Chirp, PaLM2 for audio summarization — Image from author

In the Getting started with Chirp blog post, you learned about Chirp, a Google Cloud’s 2B-parameter speech model which is able to transcribe audio content for over 100 languages. The article provided a step-by-step guide on how to get started with Chirp on Vertex AI using Cloud Speech-to-Text API (v2).

Imagine now that you want to develop an application for transcribing and summarizing audio content like a TEDx episode. In particular, each time a new audio is available you can use Chirp to transcribe the audio. Then, you can summarize resulting transcriptions using PaLM, the large language model on Vertex AI.

To navigate this entire process you need an orchestration framework. This is where LangChain shines. LangChain provides a framework which abstracts the sequence of required API calls combining them into a single, coherent application.

This article shows you how to use Chirp in combination with PaLM 2 API using LangChain. By the end, you will know what you need to build an simple audio summarization application.

Before to start…

Explaining how LangChain works is out of scope of this blog post. But it is important to know the following concepts if you want to build an audio summarisation application with LangChain:

  • Prompt which is text that is used to tell the language model what to do and what output to generate.
  • Models are the underlying language models used by LangChain. LangChain supports various models, including Vertex AI PaLM models.
  • Chains are sequences of calls to models and other tools. They are used to perform complex tasks like summarizing text. Chains are made up of prompts, models, and tools.
  • Tools are any third-party libraries or services that can be used by LangChain. They can be used to do things like interacting with Cloud Speech-to-Text API (v2).
  • Agents are a type of chain for implementing reasoning with LLMs. They are used to create applications that can interact with the defined environment (prompts, models, tools) to make decisions and take actions.

In this scenario, you can combine those concepts to build an ReAct agent that:

  1. For a given audio transcription and summarization task represented by an input prompt
  2. Uses a Vertex AI Chat model based chain to decide how to get the audio to transcribe
  3. Chooses a custom tool which transcribes the audio using the Cloud Speech-to-Text API (v2)
  4. Observes the resulting transcription
  5. Uses the Vertex AI Chat model based chain to decide how to summarize the transcription

Looking at the (3), you realize that you need a custom tool. In fact there is no STT tool for using Chirp at the time of this writing. But LangChain provides different ways for defining Custom Tools. Let’s see how you can build a GoogleCloudSpeech2TextTool, a custom tool to interact with Cloud Speech-to-Text API by subclassing the BaseTool.

Build the GoogleCloudSpeech2TextTool custom tool

To build a custom tool, see the LangChain documentation. There are many ways to build a custom tool depending on what you are trying to accomplish. In this blog, you have GoogleCloudSpeech2TextTool as a subclass BaseTool. Defining a custom tool by subclassing BaseTool gives you more control propagating results to nested chains or other tools. GoogleCloudSpeech2TextTool subclass wraps the batch_recognize method described in the Getting started with Chirp blog post.

When the LangChain agent decides to use the GoogleCloudSpeech2TextTool, the agent runs it by checking if the environment is set to use Google Cloud Speech-to-Text API (v2) using Pydantic root_validator decorator.

#general
from typing import Dict, Optional, Any
from pydantic import root_validator

class GoogleCloudSpeech2TextTool(BaseTool):
"""
Tool that queries the Google Cloud Speech-to-Text API.
To know more about the API, visit https://cloud.google.com/speech-to-text/docs
"""

# overwrite class attributes
name = 'google_cloud_speech2text'
description = 'Tool that queries the Google Cloud Speech-to-Text API to transcribe audio files.'
project_id: str = ""
region: str = ""
bucket_name: str = ""
credentials_file: str = ""
recognizer_name: str = ""
speech_client: Any = None
recognizer: Any = None
bucket_uri_output: str = ""

@root_validator(pre=True)
def _validate_environment(cls, values: Dict):
"""Check if the environment is set to use Google Cloud Speech-to-Text API."""

project_id = get_from_dict_or_env(values, 'project_id', 'GOOGLE_CLOUD_PROJECT_ID')
region = get_from_dict_or_env(values, 'region', 'GOOGLE_CLOUD_REGION')
bucket_name = get_from_dict_or_env(values, 'bucket_name', 'GOOGLE_CLOUD_BUCKET_NAME')
credentials_file = get_from_dict_or_env(values, 'credentials_file', 'GOOGLE_CLOUD_CREDENTIAL_FILES')
recognizer_name = get_from_dict_or_env(values, 'recognizer_name', 'GOOGLE_CLOUD_RECOGNIZER_NAME')

try:
from google.cloud import speech_v2
from google.api_core.client_options import ClientOptions
values['speech_client'] = speech_v2.SpeechClient(client_options=ClientOptions(
api_endpoint=f'{region}-speech.googleapis.com', credentials_file=credentials_file))
values['recognizer'] = cloud_speech.Recognizer(
name=recognizer_name,
)
values['bucket_name'] = bucket_name
values['bucket_uri_output'] = f'gs://{bucket_name}/transcriptions'
except RuntimeError:
raise RuntimeError('Please set the environment to use Google Cloud Speech-to-Text API.')
return values

The GoogleCloudSpeech2TextTool assumes that you already have a Chirp recognizer running on Google Cloud.

Next, the custom tool runs a Chirp-based batch recognition request to transcribe the audio using the _speech2text function below.

#general
import json

# speechtotext
from google.protobuf.json_format import MessageToJson

class GoogleCloudSpeech2TextTool(BaseTool):

...

def _speech2text(self, audio_uri):
"""Transcribe the given audio file."""

# import the required modules
from google.cloud import speech_v2
from google.api_core.client_options import ClientOptions
from google.cloud.speech_v2.types import cloud_speech

# create a batch recognize request
batch_request = cloud_speech.BatchRecognizeRequest(
recognizer=self.recognizer.name,
recognition_output_config={
"gcs_output_config": {
"uri": self.bucket_uri_output
}
},
files=[{
'uri': audio_uri
}],
)

# start the transcription
operation = self.speech_client.batch_recognize(request=batch_request)
# check if the operation is done
self._check_status(operation)
# get the transcription
operation_result = operation.result()
serialized_result = MessageToJson(operation_result._pb)
result = json.loads(serialized_result)
transcription_uri = result['results'][audio_uri]['uri']
return transcription_uri

Then it postprocesses the resulting transcriptions using the _get_transcription function you have here.

#general
from urllib.parse import urlsplit

class GoogleCloudSpeech2TextTool(BaseTool):

...

def _get_transcription(self, transcription_uri):
transcription_filename = urlsplit(transcription_uri).path[1:]
transcriptions_output = get_bucket_file(self.bucket_name, transcription_filename)
transcriptions_spec = transcriptions_output['results']
transcriptions = [transcription['alternatives'][0]['transcript'] for transcription in transcriptions_spec if
'alternatives' in transcription.keys()]
transcription_text = " ".join(transcriptions)
return transcription_text

According to the LangChain documentation, the GoogleCloudSpeech2TextTool requires both _run and _arun methods for running transcription steps in synchronous and asynchronous ways, respectively.

# general
from typing import Dict, Optional, Any

# langchain
import langchain
from langchain.callbacks.manager import AsyncCallbackManagerForToolRun, CallbackManagerForToolRun

class GoogleCloudSpeech2TextTool(BaseTool):

...

def _run(
self,
query: str,
run_manager: Optional[CallbackManagerForToolRun] = None,
) -> str:
"""Use the tool."""
try:
transcription_uri = self._speech2text(query)
transcription_text = self._get_transcription(transcription_uri)
return transcription_text
except Exception as error:
raise RuntimeError(f"Error while running GoogleCloudSpeech2TextTool: {error}")

async def _arun(
self,
query: str,
run_manager: Optional[AsyncCallbackManagerForToolRun] = None,
) -> str:
"""Use the tool asynchronously."""
raise NotImplementedError("GoogleCloudSpeech2TextTool does not support sync")

While the GoogleCloudSpeech2TextTool is not a ready-to-production tool but it gives you an idea on how it is possible to build a tool the agent can use to take actions including calling Chirp for audio transcription using Google Cloud Speech-to-Text API (v2).

Now that you know how to build a custom tool for audio transcription using Google Cloud Speech-to-Text API (v2), let’s see how you can use it in a Langchain agent.

How to use GoogleCloudSpeech2TextTool in a Langchain agent

To build a Langchan agent which transcribes and summarizes audio content using the GoogleCloudSpeech2TextTool tool, the following steps are required:

  1. Create a Chirp recognizer
  2. Initiate VertexAIChat model, Memory and GoogleCloudSpeech2TextTool tool
  3. Initialize the agent
  4. Upload a new audio file to transcribe
  5. Run the agent for audio summarization

To create a Chirp recognizer, you see Getting started with Chirp, which shows how to use Google Cloud Speech-to-Text API (v2) for getting a Chirp recognizer. You may use a create_recognizer.

recognizer_name = create_recognizer(
recognizer_config=RECOGNIZER_CONFIG,
project_id=PROJECT_ID,
region=REGION)

recognizer = get_recognizer(recognizer_name)

Where the RECOGNIZER_CONFIG defines model, language and some default configuration of the recognizer. As a result, you get an instance ready to transcribe your audio continent with the Chirp model.

Next you initiate ChatVertexAI model, Memory and GoogleCloudSpeech2TextTool respectively.

vertex_llm_chat = ChatVertexAI(
model_name='chat-bison',
max_output_tokens=256,
temperature=0,
top_p=0.8,
top_k=40,
)

memory = ConversationBufferMemory(memory_key="chat_history", input_key='input', output_key="output")
memory.clear()

google_cloud_speech2text_tool = GoogleCloudSpeech2TextTool()

Notice that here you use the latest version of PaLM model (chat-bison) which is the fine-tuned foundational model for multi-turn conversation of Google Cloud on Vertex AI.

After initiating the ChatVertexAI model, Memory and GoogleCloudSpeech2TextTool, you can initiate the agent.

tools = [google_cloud_speech2text_tool]
agent = initialize_agent(
tools = tools,
llm = vertex_llm_chat,
agent=AgentType.STRUCTURED_CHAT_ZERO_SHOT_REACT_DESCRIPTION,
memory=memory,
verbose=True,
return_intermediate_steps=True,
return_messages=True
)

The agent uses the ChatVertexAI model in combination with STRUCTURED_CHAT_ZERO_SHOT_REACT_DESCRIPTION, which is a ReAct reasoning strategy particularly useful with more complex tool usage. Also, the agent stores intermediate steps, for example the audio transcription, in a way that you can reuse it in a down streaming evaluation.

At this point, everything is ready. You need to upload the audio file to transcribe in a Google Cloud bucket as shown below.

audio_path = './tedx_audio.mp3'
audio_uri = 'gs://your-bucket-name/audio-files/tedx_audio.mp3'

gsutil cp $audio_path $audio_uri

In this example, you have the audio from the TEDx episode “How friendship affects your brain” by Shannon Odell. Now you can ask the agent to transcribe and summarize the audio.

agent_response = agent(
{
"input": f"Generate transcriptions starting for the following audio: {audio_uri}. Then provide a summary of the resulting trascription by describing what the speaker covers."
}
)

This is an example of the resulting reasoning implemented by the agent to summarize the content.

> Entering new  chain...
Action:
```
{
"action": "google_cloud_speech2text",
"action_input": "gs://your-bucket-name/audio-files/tedx_audio.mp3"
}
```
CreateRecognizerRequest is done.

Observation: Friendships can hold an exceptional place in our life stories.
What is it about these connections that make them so unique? Before we dive
into the science, let's first observe one in action. If I could somehow design
a best friend, you know, put together all the ideal qualities of my perfect

...

This is what psychologists call interpersonal synchrony. You first show signs of the ability to sync with
others as infants, synchronizing movements and babbling with your parents.
As you get older and spend more time outside the home, you increasingly show
this synchrony with your peers. For example, imagine walking down the street
with a friend. Often without consciously thinking, you stroll at the same pace
and follow the same path. You and your best friend may not be only on the same
page, but also scientifically in step.com

Thought:The speaker talks about the science of friendship.
They discuss how friendships are formed in adolescence,
how they are different from childhood friendships,
and how they can be more intimate. They also discuss the importance of
interpersonal synchrony in friendships.

> Finished chain.

Notice how the summary is generated using the ReAct approach which uses a chain-of-thought reasoning to mimic human-like task-solving patterns.

Bonus: Evaluate the Chirp model

At this point you may wonder how you can evaluate the resulting transcriptions the agent generates. Sure, you can listen to the audio and compare with the ground truth. But it would require time. Luckily, you can use several quantitative metrics to evaluate automatic speech recognition systems. Below are some of the most common metrics to consider.

  • Word error rate (WER) is the most common metric. It calculates the number of incorrectly recognized words divided by the total number of words in the reference transcript.
  • Word information lost (WIL) is a measure of how much information is lost when the model transcribes a word. It is based on the incorrect number of phonemes over the total number of phonemes in the word.
  • Word information preserved (WIP) is a measure of the amount of information that is preserved and it is calculated as the number of phonemes that are correctly recognized divided by the total number of phonemes in the word.

To calculate those metrics, you can use JiWER, a simple and fast python package which supports several measures to evaluate automatic speech recognition system metrics. All you have to do is provide the ground truth and the generated transcriptions. Then the library will calculate metrics for you. Below you can see a function you can define to calculate STT metrics.

def get_stt_metric(truth, transcription, metric='wer'):
"""A function to calculate common automatic speech recognition system metrics"""
chirp_evaluation = jiwer.compute_measures(truth, transcription)
return chirp_evaluation.get(metric)

Once you define an evaluation function, you can use it to calculate WER, WIL and WIP as shown below.

audio_uris = [audio_uri]

truth_transcriptions = [
"""
Friendships can hold an exceptional place in our life stories. What is it about these connections that make them so unique?

...

"""
]
generated_transcriptions = [generated_transcriptions]

evaluations = []
for a, t, h in zip(audio_uris, truth_transcriptions, generated_transcriptions):
evaluation = {}
evaluation['audio_uri'] = a
evaluation['truth'] = t
evaluation['hypotehesis'] = h
evaluation['wer'] = get_stt_metric(t, h, 'wer')
evaluation['wil'] = get_stt_metric(t, h, 'wil')
evaluation['wip'] = get_stt_metric(t, h, 'wip')
evaluations.append(evaluation)

For better representation, you can create a Pandas DataFrame where each row represents an audio transcription with all associated metrics.

evaluations_df = pd.DataFrame.from_dict(evaluations)
evaluations_df.reset_index(inplace=True,drop=True)
evaluations_df

Here is the resulting evaluation you can get with the audio from the TEDx episode.

According to those metrics, Chirp does a pretty good job at transcribing the TEDx episode!

Summary

As previously discussed, Chirp is a very promising foundational model. And It has several possible applications. This article shows how to use Chirp in combination with PaLM 2 API and LangChain to build an audio summarization application.

Although LangChain allows you to quickly prototype an agent, you might face several issues. Debugging LangChain can be difficult, the documentation is incomplete and the strong dependencies with Hugging Face and OpenAI makes it difficult to generalize. So, be sure to make a conscious choice when deciding which orchestration framework to use with your GenAI application.

That being said, the agent (let’s call it M.A.I.A) is just one component of a possible cognitive system capable of dealing with audio content.

What if I would like having an agent that transcribes audio content each time it becomes available on the website using RSS? What if I want to find audio content that is similar to another one I enjoyed? And what about answering questions related to transcribed content? These are just some scenarios you may want to consider. And they require more than an orchestrator.

If you are interested in seeing how to build such a cognitive system, feel free to reach out to me on LinkedIn or Twitter for further discussion. In the meantime, I hope you found the article interesting. If so, clap or leave your comments.

Big kudos to Nishit Patel, Ann Farmer and all googlers for feedback and suggestions.

--

--

Ivan Nardini
Google Cloud - Community

Developer relations engineer at @GoogleCloud who is passionate with Machine Learning Engineering. The Lead of MLOps.community’s Engineering Lab.