EdgeCloud: Introducing Image-to-Video Generative AI Model

Published in

Theta Network

4 min readJun 13, 2024

Theta team is excited to release the newest addition to the EdgeCloud AI showcase: an image-to-video AI model card for the Stable Video Diffusion model from Stability AI! This has also been added to the Edge Cloud Model explorer dashboard.

Background

Generative AI, particularly image-to-video models, represents a significant technical advancement and its application could fundamentally transform the media landscape. These AI models convert static images into dynamic video content, using advanced algorithms and deep learning techniques. This transformation from image to video is not only ground breaking but also opens up a myriad of opportunities across various industries, from entertainment and advertising to education and healthcare.

Many of today’s algorithmic genAI image-to-video concepts draw from Theta’s own experience from almost a decade ago. Jieyi Long and the team developed cutting edge, complex image and video technologies in 2015 in the VR space captured by this patent, and an example spherical 360 output video here generated from within the Unity game engine utilizing an array of virtual cameras, multiple image frames spliced together over a time dimension.

“…a video recorded using a virtual camera array during a game play of a source computer game. Next, upscaling the received video to a higher resolution, and interpolating neighboring video frames of the upscaled video for insertion into the upscaled video at a server. Finally, generating a spherical video from the interpolated video for replay in a virtual reality environment. The virtual camera array includes multiple virtual cameras each facing a different direction, and the video is recorded at a frame rate and a resolution lower than those of the source computer game. The spherical videos are provided on a video sharing platform.”

How Image-to-Video Generative AI Models Work

Image-to-video generative AI models leverage neural networks, specifically Diffusion Transformers (DiTs), Generative Adversarial Networks (GANs) and Video Latent Diffusion Models (Video LDMs). These models are trained on vast datasets of images and videos, learning patterns and movements to predict and generate video frames from static images.

DiTs: Diffusion Transformer is a class of diffusion models that are based on the transformer architecture, which aims to improve the performance of diffusion models by replacing the commonly used U-Net backbone with a transformer. The impressive Sora demo from OpenAI is rumored to be powered by a DiT network.

GANs: Consist of two neural networks, a generator and a discriminator, which work in tandem. The generator creates video frames, while the discriminator evaluates their realism, refining the output through iterative training.

Video LDMs: These models train the main generative model in a latent space of reduced computational complexity. Many of the Video LDMs leverage a pretrained text-to-image model and insert temporal mixing layers of various forms into the pretrained architecture, which produces a model that can easily be finetuned for image-to-video generation or multi-view synthesis.

Based on this general approach, the Theta team is excited to release its newest image-to-video AI model card for the Stable Video Diffusion model from Stability AI, which has been added to the EdgeCloud AI showcase and the Edge Cloud Model explorer dashboard. Take note that this technology is still in its early development and video quality will improve over time. Leveraging EdgeCloud, any AI developer can now access this model and begin experimenting with novel media and entertainment applications.

Market Opportunities

The market for image-to-video generative AI is rapidly growing, driven by advancements in AI research and increasing demand for dynamic content. Several tech giants and startups are investing heavily in this technology, developing new applications and improving existing models. Companies like Google, NVIDIA, and OpenAI are at the forefront, integrating these models into their products and services, while startups are emerging to develop specialized services and applications.

A number of industries can reap the benefits of image-to-video AI technology and tools to create original content more quickly and efficiently:

Entertainment and Media: A significant market segment, leveraging AI to create animations, visual effects, and content generation.

Advertising: Marketers use AI-generated videos for personalized and targeted advertising campaigns.

Education: Educational content creation, including instructional videos and virtual tutors.

Healthcare: Applications in medical imaging, patient education, and virtual health consultations.

The Theta team invites you to try the new image-to-video AI showcase and have fun experimenting!

EdgeCloud: Introducing Image-to-Video Generative AI Model

Background

How Image-to-Video Generative AI Models Work

Market Opportunities

Written by Theta Labs