Unleash Your Inner Director: Generate AI-Powered Short Story Videos

Esther Irawati Setiawan
Google Cloud - Community
5 min readMar 31, 2024

In this article, you’ll explore how to build a creative project that combines the power of text generation, image creation, and video editing—all powered by various AI tools! We’ll walk through the process of building a Streamlit application that takes a user prompt, generates a short story using the Gemini API, creates image prompts based on the story, uses the Imagen API to generate those images, and finally combines everything into a short story video using moviePy and a text-to-speech library.

The steps in creating this project are as follows:

  • Setting up Streamlit. Streamlit is a Python library that allows you to create web apps effortlessly. We’ll use Streamlit to create a user interface where users can input their prompts.
  • Story generation with Gemini API. The Gemini API is a powerful tool for generating creative text formats, including short stories. Our app’ll integrate the Gemini API to take the user prompt and develop a captivating short story.
  • Extracting Image Prompts. We’ll feed the story to Gemini again and ask it to extract critical elements or scenes that can be translated into visuals.
  • Generating Images with Imagen API. The Imagen API is a cutting-edge tool from Google AI that allows you to create images based on text descriptions. We’ll feed the extracted prompts from the story into the Imagen API, resulting in multiple images that depict various scenes from the narrative.
  • Creating the Video with moviePy and Text-to-Speech. moviePy is a Python library for video editing. We’ll utilize moviePy to combine the generated images and the text-to-speech audio (derived from the short story) into a cohesive video that brings the story to life.

Create a Streamlit App

First, install and import the streamlit library

import streamlit as st

Then, create a text field for the prompt. You can design the UI any way you prefer.

st.title('Story Generator')
input = st.text_input("what's it about?")

Story Generation

Install the google-cloud-aiplatform library and import the GenerativeModel class.

from vertexai.preview.generative_models import GenerativeModel

Next, we need to set up the application's default credentials to access the gcloud API. To do that, we must first create a GCloud service account and give it access to VertexAI. Then, we use that service account as the application's default credentials. For more information, you can visit this webpage https://cloud.google.com/docs/authentication/provide-credentials-adc.

config = {
"max_output_tokens": 2048,
"temperature": 0.9,
"top_p": 1
}
model = GenerativeModel("gemini-pro")
chat = model.start_chat()

Then we can start the chat session with Gemini Pro and ask it to make a short story.

prompt = f"""generate a short story about {input}"""
with st.spinner('Writing story…'):
story_response = chat.send_message(prompt, generation_config=config)
story = story_response.text
st.write(story)

Image Prompt Extraction

We then want to get image prompts from the previously generated story; we can do that easily by continuing the previous chat session and asking it to create image prompts from that story.

prompt = """
make several prompts for an image generator that fits the story progress from start to finish and put it in a json list
(note: only output the raw json list. don't use character names for the prompt unless they are a popular character, instead describe their physical appearance for each prompt they appear in).
ex: [{"prompt":"image 1 prompt"},{"prompt":"image 2 prompt"},{"prompt":"image 3 prompt"}]
"""
with st.spinner('Making prompts…'):
# Generate image prompts
im_prompts_response = chat.send_message(prompt, generation_config=config)
im_prompts_str = re.sub(r"```json", "", im_prompts_response.text)
im_prompts_str = re.sub(r"```", "", im_prompts_str)
prompts = json.loads(str(im_prompts_str))
st.json(prompts, expanded=False)

Imagen API Integration

After getting the prompts for the images, we need to connect to the Imagen API. As of the making of the website, there was no built-in Python library for Imagen integration, so the requests library was used to call the API.

But before that, we need to get the bearer token first, which we can do with the code below.

from google.oauth2 import service_account
credentials = service_account.Credentials.from_service_account_file('./key.json')
auth_req = google.auth.transport.requests.Request()
credentials.refresh(auth_req)

The key.json here is the cloud service account key, the service account which was used earlier. For more information on how to get service account keys, you can visit this page https://cloud.google.com/iam/docs/keys-create-delete.

def generate_image(prompt):
url = f"https://us-central1-aiplatform.googleapis.com/v1/projects/YOUR_PROJECT_NAME/locations/us-central1/publishers/google/models/imagegeneration:predict"
body = {
"instances": [
{
"prompt": prompt
}
],
"parameters": {
"sampleCount": 1
}
}
# A GET request to the API
response = reqs.post(url, json=body, headers={"Authorization" : f"Bearer {credentials.token}"})
# Print the response
response_json = response.json()
if "predictions" not in response_json:
return None
base64_str = response_json["predictions"][0]["bytesBase64Encoded"]
# Decode the base 64 string to bytes
image_bytes = base64.b64decode(base64_str)
return BytesIO(image_bytes)

To make things easier, create a function that accepts a prompt and returns the bytes of the image generated. The image bytes can be opened easily with the Image class from the PIL library.

from PIL import Image
import numpy as np
np.array(Image.open(generate_image(prompt)))

We can get the images from each prompt easily with this.

images = [generate_image(p) for p in prompts]
images = [np.array(Image.open(im)) for im in images if im is not None]

Video Creation

Now that the images are done, we need to do the Text To Speech before we combine them into a video.

from google.cloud import texttospeech
def generate_tts(text):
client = texttospeech.TextToSpeechClient()
name = "en-US-Standard-F"
language_code = "en-US"
voice = texttospeech.VoiceSelectionParams(
language_code=language_code,
name=name)
# Synthesize the speech
audio_config = texttospeech.AudioConfig(
audio_encoding=texttospeech.AudioEncoding.MP3,
speaking_rate=1.2)
synthesis_input = texttospeech.SynthesisInput(text=text)
response = client.synthesize_speech(input=synthesis_input, voice=voice, audio_config=audio_config)
return response.audio_content
audio = generate_tts(story)

Next, we must calculate the audio duration and combine the clips with the TTS.

import soundfile as sf
from moviepy.editor import *
with tempfile.NamedTemporaryFile(suffix=".mp3") as tmp_audio:
with open(tmp_audio.name, "wb") as audio_file:
audio_file.write(tts) # Write audio to temporary file
tts_duration = len(images) * 3
with sf.SoundFile(tmp_audio.name) as audio:
tts_duration = 1 + audio.frames / audio.samplerate
clips = []
transition_duration = 1.5
for image in images:
# Create an image object from the bytes
image_clip = ImageClip(image, duration=(tts_duration / len(images)))
image_clip = image_clip.fadein(transition_duration).fadeout(transition_duration)
clips.append(image_clip)
final_clip = concatenate_videoclips(clips)
audio_clip = AudioFileClip(tmp_audio.name)
final_clip = final_clip.set_audio(audio_clip)
final_clip.fps = 8
final_clip.write_videofile("VIDEO_FILE_PATH", fps=final_clip.fps)

Run them together

Within your Streamlit app, run these in sequence.

  • Upon receiving a user prompt, trigger the story generation, prompt extraction, image generation, and video creation functions.
  • Display the final video within the Streamlit app for the user to view.
  • The above examples are just the bare minimum for the program. Add other features as you see fit!

Conclusion

This project demonstrates the exciting possibilities of combining various AI tools to create interactive and engaging experiences. With Streamlit for user interaction, Gemini and Imagen for creative content generation, and moviePy for video editing, you can build a unique platform for users to explore storytelling through AI.

--

--