Turn Google Docs into AI-Powered Podcasts with Google Cloud
Welcome!
In one of my previous YouTube tutorials, we explored how to use NotebookLM to generate an audio overview. The Audio Overview feature can turn documents, slides, charts and more into engaging discussions with one click. You can check out the tutorial given below to learn more about NotebookLM and generating audio overviews.
While NotebookLM’s Audio Overview feature is incredibly useful, there is a need for custom audio generation due to the following reasons:
- Control and Customisations
- Scalability and Flexibility
- Better Integration
Though NotebookLM offers great functionality, it lacks customisation options, such as adjusting the script format, defining the number of speakers, and other elements that make podcasts more engaging.
Acknowledgment
This project was inspired by Building a Dynamic Podcast Generator: Inspired by Google’s NotebookLM and Illuminate on Medium. The original article provided insights into structuring AI-generated podcasts using Google Cloud services. This implementation builds upon those ideas with additional enhancements for speaker customization, automation, and cloud deployment.
Our Solution
On evaluating current tools available, I built a seamless solution to convert text into engaging audio podcasts/ overviews using the following tools:
- Google Apps Script
- Cloud Run functions
- Gemini 1.5 Flash
- Google TTS API
- Volume Mounts
- Cloud Storage
With the combination of these tools the current workflow looks like this:
- The script accesses your file contents from a Google Docs.
- The text is processed by Gemini 1.5 Flash, which generates a structured podcast script
- The script is processed using Google’s TTS API and Volume Mounts to create the audio.
- The final podcast is stored in a Cloud Storage bucket for easy access and distribution.
Now that we understand the workflow, let’s get started.
Sample Google Doc
For this tutorial, I’ll use a Google Doc as the source for my podcast content. Our goal is to extract the document’s contents and generate a structured podcast script using Gemini.
Cloud Run Function — generateaudio
Before proceeding, we need to deploy a Cloud Run function that will process the podcast script and generate the final audio.
Understanding the Python Script
The generateAudio
service is a Flask-based API that processes a structured conversation, generates AI voices for different speakers, merges the audio files, and stores the final podcast in Google Cloud Storage.
Key Components
- Text-to-Speech Processing: Uses Google Cloud Text-to-Speech API to generate AI voices.
- Speaker Voice Mapping: Assigns different voices to different speakers.
- Audio Merging: Combines individual speech segments into a single podcast.
- Google Cloud Storage Upload: Stores the final MP3 file in the cloud.
- Flask API Endpoint: Accepts structured input via HTTP requests.
import os
import flask
import datetime
import re
import json
from google.cloud import storage, texttospeech
from pydub import AudioSegment
We start by setting up dependencies such as Flask, which is used to create the API, Cloud Storage for storing the final podcast, TTS for converting text into spoken audio.
FILE_DIR = '/tmp/' # Temporary storage in Cloud Run
BUCKET_NAME = 'aryan_bucket'
client = texttospeech.TextToSpeechClient()
Next, we configure the temporary storage and Cloud Storage bucket where the final audio will be stored.
speaker_voice_map = {
"Sascha": "en-US-Wavenet-D",
"Marina": "en-US-Journey-O"
}
Since I want different speakers to have unique voices, I create a dictionary mapping speaker names to AI-generated voices.
def synthesize_speech_google(text, speaker, index):
synthesis_input = texttospeech.SynthesisInput(text=text)
voice = texttospeech.VoiceSelectionParams(
language_code="en-US",
name=speaker_voice_map[speaker]
)
audio_config = texttospeech.AudioConfig(
audio_encoding=texttospeech.AudioEncoding.LINEAR16
)
response = client.synthesize_speech(
input=synthesis_input, voice=voice, audio_config=audio_config
)
filename = f"{FILE_DIR}{index}_{speaker}.wav"
with open(filename, "wb") as out:
out.write(response.audio_content)
print(f'Audio content written to file "{filename}"')
Then, I define a function to convert text into speech using Google’s Text-to-Speech API. Each speaker’s dialogue is processed separately, and the resulting audio files are stored temporarily.
def merge_audios(audio_folder, output_file):
combined = AudioSegment.empty()
audio_files = sorted(
[f for f in os.listdir(audio_folder) if f.endswith(".wav")],
key=lambda x: int(re.findall(r'\d+', x)[0])
)
for filename in audio_files:
audio_path = os.path.join(audio_folder, filename)
print(f"Processing: {audio_path}")
audio = AudioSegment.from_file(audio_path)
combined += audio
combined.export(output_file, format="mp3")
print(f"Merged audio saved as {output_file}")
Once all individual speech segments are generated, they need to be combined into a single podcast file. For this, I use the pydub
library to merge all the audio clips.
def upload_to_gcs(file_path, file_name):
client = storage.Client()
try:
blob = client.bucket(BUCKET_NAME).blob(file_name)
blob.upload_from_filename(file_path)
print(f"File uploaded to GCS: {BUCKET_NAME}/{file_name}")
except Exception as e:
print(f"Error uploading to GCS: {e}")
raise
After merging, the final podcast needs to be uploaded to Google Cloud Storage so it can be accessed easily.
def generate_audio(conversation, file_name_prefix):
os.makedirs(FILE_DIR, exist_ok=True)
for index, part in enumerate(conversation):
speaker = part['speaker']
text = part['text']
synthesize_speech_google(text, speaker, index)
audio_folder = FILE_DIR
final_file_name = f"{file_name_prefix}_podcast.mp3"
output_file_path = os.path.join(FILE_DIR, final_file_name)
merge_audios(audio_folder, output_file_path)
try:
upload_to_gcs(output_file_path, final_file_name)
print(f"Final podcast uploaded successfully: {final_file_name}")
except Exception as e:
print(f"Error during final upload: {e}")
raise
With these functions in place, I create the main function that orchestrates the podcast generation workflow.
def hello_world(request):
request_json = request.get_json(silent=True)
if request_json and 'variable' in request_json:
try:
conversation = json.loads(request_json['variable'])
except json.JSONDecodeError as e:
return flask.make_response(f"Invalid input: 'variable' must be a valid JSON. Error: {e}", 400)
else:
return flask.make_response("Invalid input: Please provide a 'variable'.", 400)
print(f"Received conversation script: {conversation}")
timestamp = datetime.datetime.now().strftime("%Y%m%d_%H%M%S")
file_name_prefix = f"podcast_{timestamp}"
try:
generate_audio(conversation, file_name_prefix)
except Exception as e:
return flask.make_response(f"Error during podcast generation: {e}", 500)
response_message = f"Podcast generated and uploaded to GCS with prefix: {file_name_prefix}_podcast.mp3"
return flask.make_response(response_message, 200)
To expose this as an API, I create a Flask-based endpoint that takes the conversation script as input and triggers the podcast generation process.
Once deployed, I can trigger the podcast generation using a simple HTTP request with structured conversation data. To learn more about deploying a service to Cloud Run, check out the resources given below.
To learn more about Cloud Run Volume mounts, check out the resources given below.
This completes the Cloud Run service setup. The next step is to integrate this with Google Apps Script to allow direct podcast generation from Google Docs meeting notes.
Write the Automation Script
Now that the Cloud Run function is deployed and the endpoint is active, we can proceed with implementing the code to access the Google Doc, generate the script, and send it to the Cloud Run function for processing.
While you are in the Google Doc, let’s open up the Script Editor to write some Google Apps Script. To open the Script Editor, follow these steps:
- Click on Extensions and open the Script Editor.
2. This brings up the Script Editor as shown below.
We have reached the script editor, let’s code.
function onOpen() {
var ui = DocumentApp.getUi();
ui.createMenu('PodcastGenPro')
.addItem('Launch Sidebar', 'showSidebar')
.addToUi();
}
We start off by creating custom menus in the Google Docs interface. We use the onOpen() function that runs automatically when the document is opened. We create the menu and give a name using the ui.createMenu()
function, that contains a single item labelled Launch Sidebar. When clicked, it triggers the showSidebar()
function.
function showSidebar() {
var html = HtmlService.createHtmlOutputFromFile('Sidebar')
.setTitle('PodcastGeneratorPro')
.setWidth(800);
DocumentApp.getUi().showSidebar(html);
}
This function is responsible for displaying the custom sidebar within the Google Doc interface. It loads on HTML file named Sidebar using the HtmlService
, sets the title of the sidebar to PodcastGeneratorPro and specifies its width to 800 pixels. The sidebar is then displayed to the user.
<!DOCTYPE html>
<html>
<head>
<base target="_top">
</head>
<body>
<h1>Generate Content</h1>
<h2>Select the text you want to create a script out of</h2>
<button type="button" onclick="generateContent()">Generate</button>
<script>
function generateContent() {
google.script.run.withSuccessHandler(function(selectedText) {
// You may need to add logic here for voice selection
google.script.run.sendToGemini(selectedText);
}).getSelectedText();
}
</script>
</body>
</html>
This code is used to create the sidebar interface for the PodcastGeneratorPro tool within Google Docs. This interface allows users to select text from their document and generate a podcast.
function getSelectedText() {
var doc = DocumentApp.getActiveDocument();
var selection = doc.getSelection();
if (selection) {
var elements = selection.getRangeElements();
var selectedText = elements.map(function (element) {
return element.getElement().asText().getText();
}).join("\n");
return selectedText;
}
return "";
}
We start of by fetching the currently active Google Doc DocumentApp.getActiveDocument()
function. Using the active doc, we fetch the current selection made by the user within the document using the getSelection()
function.
Before proceeding, the function checks if any text is selected. If not, it returns an empty string. It maps over each element, converts it to text, and extracts the text content. The results are joined with newline characters (\n
) to form a single string. The extracted text is returned to the caller.
function sendToGemini(selectedText) {
const GEMINI_KEY = 'YOUR_API_KEY';
const GEMINI_ENDPOINT = `https://generativelanguage.googleapis.com/v1beta/models/gemini-1.5-flash-latest:generateContent?key=${GEMINI_KEY}`;
We then declare the API key and the Gemini Endpoint that is fetched from Google AI Studio. For this tutorial we are using the Gemini 2.0 Flash model. To find out more about the Gemini API and all the models available, check out the link given below.
var requestBody = {
"contents": [
{
"parts": [
{
"text": `You are an AI scriptwriter for a podcast called *The Machine Learning Engineer*. Your task is to generate a natural, engaging, and dynamic podcast script in a conversational format between two speakers: Marina and Sascha. The discussion should feel spontaneous and lively, with realistic interactions.
Use the following format:
[
{
"speaker": "Marina",
"text": "<Marina's opening line>"
},
{
"speaker": "Sascha",
"text": "<Sascha's response>"
},
]
### Topic:
${selectedText}
Use the information provided above as the core discussion material for this episode. Marina will introduce the topic, and Sascha will explain it in detail, with back-and-forth interactions that make the conversation engaging.
### Tone & Style:
- Friendly, engaging, and conversational.
- Natural flow, like two experts chatting informally.
- Keep it dynamic with occasional expressions of surprise, humor, or curiosity.
### Instructions for Response:
- The response should be a structured JSON list of dialogue exchanges.
- Maintain a lively discussion with a smooth flow of ideas.
- Ensure Sascha provides clear and engaging explanations while Marina asks insightful follow-up questions.
Return the response in the specified JSON format, ensuring it stays structured correctly while keeping the dialogue engaging and informative.
`
}
]
}
]
};
This is the prompt that I’m going to be using for this tutorial. The prompt instructs Gemini to generate a script based on the document content. It’s designed to produce a clear and engaging podcast.
Note: You can always adjust the prompt based on the contents of your Google Doc and the type and style of podcast you want to generate. Adjusting the prompt will produce clear and engaging podcasts.
var headers = {
"Content-Type": "application/json",
};
var options = {
"method": "POST",
"headers": headers,
"payload": JSON.stringify(requestBody),
"muteHttpExceptions": true
};
We know prepare the Gemini API call with necessary headers and payload. We set the content type to JSON and prepare the options.
try {
var response = UrlFetchApp.fetch(GEMINI_ENDPOINT, options);
var data = JSON.parse(response.getContentText());
Logger.log("Gemini API Response: " + JSON.stringify(data));
// Extract the JSON part from the response
if (data.candidates && data.candidates[0].content.parts[0].text) {
var scriptText = data.candidates[0].content.parts[0].text;
// Remove code block markers and parse JSON
var startIndex = scriptText.indexOf('[');
var endIndex = scriptText.lastIndexOf(']');
var cleanScriptText = scriptText.substring(startIndex, endIndex + 1);
var conversationScript = JSON.parse(cleanScriptText);
Logger.log(conversationScript)
We then use UrlFetchApp.fetch()
function to call the Gemini API. After which we parse the response and extract the generated text. We then process the response from the Gemini API, extracting the relevant script data, and preparing it for further use.
const url1 = 'YOUR_CLOUDRUN_URL';
var payload1 = {
'variable': JSON.stringify(conversationScript)
};
var options1 = {
'method': 'post',
'contentType': 'application/json',
'payload': JSON.stringify(payload1),
'muteHttpExceptions': true
};
We now declare the Cloud Run endpoint and prepare the payload and option for the API call.
var cloudFunctionResponse = UrlFetchApp.fetch(url1, options1);
Logger.log("Script sent to Cloud Run");
Logger.log("Cloud Run Response: " + cloudFunctionResponse.getContentText());
} else {
Logger.log("Unexpected Gemini API response format: " + JSON.stringify(data));
}
} catch (e) {
Logger.log("Error during Gemini API or Cloud Run call: " + e.message);
}
}
We then use UrlFetchApp.fetch()
function to call the Cloud Run API that we deployed. Once we send the details to the cloud run function, we log a confirmation message and then log the response we receive from the cloud run function.
Our code is complete and good to go.
Check the Output
Its time to see if the code is able to launch the sidebar, grab the selected text, generate a script and send it to the cloud run function to generate the podcast and upload it successfully to the cloud storage bucket.
On successful execution of the onOpen()
function, the sidebar is launched. To generate the podcast, select the text you want to send as context for the script generation and click on Generate.
Here you can see that we are successfully able to generate speaker notes from Gemini and generate the podcast. To view the audio file, head over to your cloud storage bucket and download the audio file.
Conclusion
By leveraging Google Cloud’s AI capabilities, you can transform any text document into a structured, engaging podcast. This solution offers customisation, automation, and scalability for various use cases, from content creation to educational materials.
You can check out the code and various other Google Apps Script automation scripts on the link given below.
Feel free to reach out if you have any issues/feedback at aryanirani123@gmail.com.