Sitemap
Google Cloud - Community

A collection of technical articles and blogs published or curated by Google Cloud Developer Advocates. The views expressed are those of the authors and don't necessarily reflect those of Google.

Veo 3 Character consistency, a multi-modal, forensically-inspired approach

7 min readJul 17, 2025

--

Introduction

The synthesis of personalized video content from text and image prompts represents a significant frontier in generative AI. However, a critical and persistent challenge is maintaining the facial consistency and identity of a subject throughout a video sequence — a problem we term “identity drift.”

This article presents a novel, multi-modal workflow that systematically addresses this challenge. Our approach leverages a pipeline of interconnected models: Gemini 2.5 Pro for deep semantic analysis and structured data extraction, Imagen 3.0 for high-fidelity image synthesis and editing, and Veo video generation. The core innovation is a two-step “forensic analysis” process that creates a robust, multi-faceted identity vector for the subject. This method provides a strong, persistent signal that significantly mitigates identity drift, enabling the creation of high-quality, personalized videos with unprecedented character consistency.

The full GitHub code for this approach could be retrieved here:

https://github.com/GoogleCloudPlatform/vertex-ai-creative-studio/tree/main/experiments/veo3-character-consistency

1. The Technical Challenge: Semantic Ambiguity and Identity Drift

Press enter or click to view image in full size

The core difficulty in generating a consistent video of a specific person stems from the inherent ambiguity in the latent space of generative models. A single reference image, while providing a visual anchor, represents a high-dimensional data point that captures not just the subject’s core identity features (facial structure, eye color) but also transient, state-dependent attributes (lighting, expression, pose, specific attire).

When a model is tasked with generating a new scene, it can struggle to disentangle these features. This “feature entanglement” often leads to identity drift, where the model preserves transient features (like the color of a shirt) but loses the core facial identity. The model understands what the image contains, but its understanding of who the person is remains shallow and unstable. This is because the prompt “a man on a beach” provides a strong semantic pull that can easily overpower the weaker, more nuanced signal of the subject’s identity contained within the reference image alone.

Our hypothesis is that to solve identity drift, we needed to provide the generative model with a guidance signal that was not only visual but also deeply semantic and structurally explicit.

2. Our Hypothesis: Structured Forensic Data as a Robust Identity Vector

Press enter or click to view image in full size

We hypothesized that we could create a more robust and persistent identity signal by deconstructing a subject’s appearance into a set of objective, disentangled features. The goal was to move from a single, entangled data point (the image) to a multi-modal “identity vector” that would be less susceptible to semantic drift.

The inspiration for our approach came from the field of forensic science, specifically the creation of composite sketches. A forensic artist doesn’t just look at a photo; they break a face down into a standardized set of components. We aimed to replicate this process using a large language model.

Our solution is a pipeline that generates a FacialCompositeProfile: a highly detailed, structured JSON object that serves as a machine-readable “facial fingerprint.” This structured data, when combined with the original image and a natural language translation of the profile, forms a powerful, multi-faceted guidance system for the subsequent generative steps.

3. System Architecture: A Deep Dive into the Generation Pipeline

Our workflow is a carefully orchestrated, six-stage pipeline. Each stage is designed to progressively build upon the last, ensuring the final video output is anchored to the subject’s core identity.

Press enter or click to view image in full size

3.1. Stage 1: Structured Feature Extraction with Gemini 2.5 Pro

The process begins with forensic analysis. For each reference image provided, we task Gemini 2.5 Pro with a specific role: “You are a forensic analyst.” We instruct it to analyze the image and populate a Pydantic schema named FacialCompositeProfile (defined in utils/schemas.py).

Rationale:

  • Forcing Objectivity: By constraining the model’s output to a rigid JSON schema, we force it to move from subjective interpretation to objective analysis. It cannot simply say “a man with brown hair”; it must classify the hair length, texture, and hairline according to a predefined enumeration.
  • High Granularity: The schema is intentionally exhaustive, containing nested objects for everything from HeadAndFaceStructure (face shape, jawline) to EyeAndEyebrowFeatures (eye shape, color, details). This creates an incredibly rich and detailed feature set.
  • Disentanglement: This process effectively disentangles the subject’s core identity from the image’s transient state. The resulting JSON is a pure representation of the person’s features, independent of the original photo’s lighting or mood.

3.2. Stage 2: Semantic Bridging for Image Synthesis

The structured FacialCompositeProfile is not optimal to be consumed by Imagen’s primary text prompt interface or to be provided in the description field of the image reference API. Therefore, the JSON object is fed back into Gemini 2.5 Pro with a new instruction: translate this structured data into a descriptive natural language paragraph.

3.3. Stage 3: High-Fidelity Image Generation with Imagen 3.0

We now have a powerful, multi-modal set of inputs for Imagen 3.0:

  1. The original reference images.
  2. The forensically-derived natural language description of the subject.
  3. The user’s desired scene prompt (e.g., “in the desert wearing a spiderman outfit”).

Imagen’s edit_image function with the SUBJECT_TYPE_PERSON configuration is used to synthesize four candidate images.

3.4. Stage 4: Automated Curation and Quality Control

Due to the inherent stochasticity of generative models, not all four outputs will be equally successful. We introduce an automated curation step, again using Gemini 2.5 Pro. The model is presented with the original reference images and the four generated candidates. Its task is to select the best_image_path based on which candidate most faithfully reproduces the subject’s core facial identity.

Rationale:

  • Handling Variance: This step is a pragmatic approach to quality control. It ensures that only the highest-quality, most consistent output from the image generation stage is passed down the pipeline.
  • Closing the Loop: Using the same multi-modal model for curation that we used for analysis creates a consistent evaluative framework.

3.5. Stage 5: Cinematic Scene Expansion via Outpainting

The selected candidate image is generated at a 1:1 aspect ratio. To prepare it for video, we use Imagen 3.0’s outpainting capability to expand the image to a cinematic 16:9 aspect ratio. The original, high-fidelity generation is placed within this new canvas, and the model intelligently fills in the surrounding environment based on the prompt.

Rationale:

  • Compositional Control: This gives us explicit control over the final scene composition. It creates a stable, cinematic backplate for the video generation, preventing unpredictable camera framing or movements that might arise from generating a video directly from a square image.
  • Preserving Fidelity: By using the generated image as an anchor, we ensure the core subject remains untouched and consistent while the background is expanded.

3.6. Stage 6: Temporally Coherent Video Synthesis with Veo

Finally, the 16:9 outpainted image is passed to Veo. We use Gemini 2.5 Pro one last time to generate a rich, cinematic prompt for Veo based on the image content. Veo then uses the high-fidelity image as a starting frame to generate an 8-second video.

4. Analysis of Results and Future Directions

This multi-stage, multi-modal pipeline has proven to be exceptionally effective at mitigating identity drift. By front-loading the workflow with a deep, structured analysis of the subject’s identity, we create a powerful and persistent guidance signal that anchors the entire generative process. The result is a final video that exhibits a remarkable level of facial consistency, even across changes in motion and expression.

5. Conclusion

The challenge of identity consistency in generative video is not merely a matter of model scale, but of architectural ingenuity. Our work demonstrates that by decomposing the problem and leveraging a multi-modal, forensically-inspired approach, we can create a robust solution that largely solves the problem of identity drift. This pipeline represents a significant step towards a future where anyone can create high-quality, personalized, and, most importantly, believable video content.

About me

I’m Chouaieb Nemri, a Generative AI BlackBelt Specialist at Google with over a decade of experience in data, cloud computing, AI, and electrical engineering. My passion lies in helping executives and tech leaders turbocharge their cloud-based AI, ML, and Generative AI initiatives. Before Google, I worked at AWS as a GenAI Lab Solutions Architect and served as an AI and Data Science consultant at Capgemini and Devoteam. I also led cloud data engineering training at the French startup DataScientest, directly collaborating with their CTO. Outside of work, I’m dedicated to mentoring aspiring tech professionals — especially people with disabilities — and I hold a 5-star mentor rating across platforms like MentorCruise and ADPList.

If you like the article and would like to support me make sure to:

--

--

Google Cloud - Community
Google Cloud - Community

Published in Google Cloud - Community

A collection of technical articles and blogs published or curated by Google Cloud Developer Advocates. The views expressed are those of the authors and don't necessarily reflect those of Google.

Chouaieb Nemri
Chouaieb Nemri

Written by Chouaieb Nemri

Generative AI @ Google - xAWS - Georgia Tech Alumni - Opinions are my own

No responses yet