How will multimodal AI redefine film and storytelling?

Gerui Wang
Turing AI & Arts Forum
7 min readJun 21, 2024

On 13 May 2024, Open AI hit the news with the announcement of its newest AI model GPT 4o, which can respond to audio, vision, and text input in real time. Earlier in the month, a music video created using Open AI’s AI system Sora, called “Washed Out — The Hardest part” was published on Youtube: In just one day, it garnered over 38k views. Directed by LA based artist Paul Trillo, human labor and intentionality were still part of the video’s making process. However, viewers are shocked: “So now a multimodal AI system can generate film-like videos without real human performers or footage shot by a real camera in real locations?” People wonder about the mesmerising capacity of AI models: what’s next?

Film still, School bus scene in the beginning, “The Hardest Part,” by Washed Out, From the album, ‘Notes from a Quiet Life’ (Official Video), Credit: Director/Editor: Paul Trillo, Video Production House: Trillo Films, paultrillo.com

Leading AI scientists have forecasted that multimodal AI systems are a key growth point in 2024 and the following years.[1] How this will impact creative industries such as film, animation, music, gaming etc., draws attention from professionals across industries and the public alike.

Generative AI video tools challenge our conceptions of media culture. Applications such as Runway AI, Stable Video Diffusion, Emu Video (Meta), and Pika Labs (which raised $55 million from top venture capitals in only about half a year since its founding) can generate high-resolution videos that last between 4–10 seconds with prompts in the formats of texts, images, and videos. They frequently release new filters and functions such as lip sync, depth extraction, 3D capture, to achieve complex visual effects in a few moments. As computing efficiency and AI chips improve in the coming years, the length of videos created by AI models may increase substantially.

How will this change storytelling — a fundamental way of humans making sense, and expressing ideas about the world?

AI generated or augmented films and videos can crack into existing modes of story lines in cinema and other media arts. One feature of AI videos is that frames of images tend to jump cut from one scene to another. The sequences disrupt logic and rationale.

In “The Hardest Part”, for instance, the young students dressed in hippies style run into spaces oddly “glued” together: from school bus interior to classrooms, from narrow hallways in laundry rooms to the interiors of 1960s-fashioned flying cars, from water inundated caves to garage on fire, and so on. The seemingly human-like characters in the video walk in crippling manners: they crash into endless spaces creating a dizzying effect for the viewers. While the main characters seem to be clearly a couple, the narrative trajectory of what they are doing and why is hard to grasp. Such a visual effect induces what film & media studies scholar Shane Denson termed as ‘discorrelation’ between the image and human experience. Images rendered by the hidden labour of computer algorithms dissolved ‘intentional relation between a perceiving subject and an image object’.[2]

Film still, The couple traversing in an inundated area before they are shown immersed in water and soon again in a train, “The Hardest Part,” by Washed Out, From the album, ‘Notes from a Quiet Life’ (Official Video), Credit: Director/Editor: Paul Trillo, Video Production House: Trillo Films, paultrillo.com

This might be caused by the AI model trained visually/formally on large quantities of images, videos, and texts, but with no understanding of the context of each image and video, especially the logic behind the sequence of scenes in films and videos. The resulting AI video, therefore, gravitates to a collage of scenes and frames. A potential coherent story melts into fragments.

The Machine Learning models used during AI training processes reduce and compress input data into vectors in the latent space. Such processes extract an image/video from its rich narrative contexts and classify it with similar objects. Only the main object in question is detected and the rest of the information is blurred. Additionally, one label could only capture the partial meaning of (sometimes not even accurate) what’s going on in an entire image. For example, in the music video The Hardest Part, many scenes show classrooms. Similar objects — “classrooms” from the training data are now clumped together. What exactly happens in each classroom? How does the happening in each room sculpts a segment of a storyline? The viewers do not know.

Notice how images of students look generalized, mushy, and blurred. They are the common “objects” perhaps labelled along similar lines such as “students studying at desks in classrooms” appearing close to the object “classroom” in the training data. But what do the students look like, what are they doing, how are they interacting with each other, what is the relationship between the main characters and these students, what are the connections between one room and the next? Contextual information and representations of real world experience evaporate. Layers of loosely associated objects feed viewers with sensations of vertigo.

Film still, The first classroom scene following the opening scene of a school bus interior, “The Hardest Part,” by Washed Out, From the album, ‘Notes from a Quiet Life’ (Official Video), Credit: Director/Editor: Paul Trillo, Video Production House: Trillo Films, paultrillo.com
Film still, The second classroom scene following the first, “The Hardest Part,” by Washed Out, From the album, ‘Notes from a Quiet Life’ (Official Video), Credit: Director/Editor: Paul Trillo, Video Production House: Trillo Films, paultrillo.com
Film still, Another classroom scene at around 0:24’, “The Hardest Part,” by Washed Out, From the album, ‘Notes from a Quiet Life’ (Official Video), Credit: Director/Editor: Paul Trillo, Video Production House: Trillo Films, paultrillo.com

Although the resistance against realistic representations resonates with 20th century art movements such as Surrealism and Dadaism, the complete destabilization of a storyline makes AI-generated videos a challenge to meaning creation. As media philosopher Joanna Zylinska argued in her book AI Art: Machine Visions and Warped Dreams, generative art as premised on the “banality of looking” changes human perception into “visual consumption” and reduces art into “mild bemusement”.[2]

One advantage of AI models is their modularity production, a core symptom of modern mass culture, as noted by Lev Manovich. [3] Video generation tools present the possibility of various combinations of movements against various backgrounds. That’s why in the music video “The Hardest Part,” the same couple are showing dressed in distinct outfits roaming in fast changing spaces. Recently, Pika Labs also introduced style features ranging from anime, watercolour, to natural and Claymation. This will allow video production in multiple styles and increase model-efficiency. However, modularity of AI videos may also cause formulaic visual effects and narratives. Portraying authentic scenes and moods will be critical for artists making impactful AI videos, opening new avenues for creativity.

Video introducing its new features, X channel of Pika Labs, “Introducing styles”, screenshot by this author, June 14, 2024

Last year, Refik Anadol’s Machine Hallucination became the newest acquisition of digital art in MoMA’s permanent collection. This event symbolizes the legitimisation of Machine Learning and AI models as a new medium for artistic creation recognised by major cultural institutions. Going back in history, examples of new technology bringing out new artistic genres are myriad. In the 16th century, multi-color woodblock prints did not replace paintings in East Asia, but thrived as a unique artistic medium spreading knowledge, art, and popular culture. In the 19th century, the invention of photography did not make paintings obsolete, but led to the flourishing of Impressionism.[4] Towards late of the century, video cameras did not replace photography but created cinema. In the early 20th century, the advent of animation did not replace human acting but was hailed as a novel way of telling stories.

So multi-modal AI such as Sora, Runway and Pika labs may not replace human-made films, but they are on the path to dramatically change our screen culture. Aesthetically, it is important for AI filmmakers to consider: What can large datasets of images, texts, and videos enable in the film making process? What limitations do they entail? If one wants to improve efficiency in generating different backgrounds to specific scenes where humans are still engaged in the acting process, AI might be a good facilitator. If one wants to create special visual effect, AI might also assist. However, if one wants to make a movie with a cogent narrative built by sequences of well-connected scenes and contents, one may encounter real challenges.

Human intentionality and emotions are the key in developing meaningful AI films. Film makers, playwrights, artists, and creators can take this opportunity to hone their story telling skills that bring out the nuances of human interactions, behaviors, sensibilities. They can also co-create with AI to understand contexts and enrich storytelling, rather than replacing it.

Refik Anadol, Machine Hallucination, Credits: Refik Anadol Studio, Alex Morozov, Carrie He, Christian Burke, Daniel Seungmin Lee, Efsun Erkilic, Kerim Karaoglu, Pelin Kivrak, Ho Man Leung, Nidhi Parsana, Raman K. Mustafa, Rishabh Chakrabarty, Toby Heinemann, Yufan Xie

[1] Shana Lynch, “What to Expect in AI in 2024: Seven Stanford HAI faculty and fellows predict the biggest stories for next year in artificial intelligence,” 8 December 2023, online at: https://hai.stanford.edu/news/what-expect-ai-2024

[2] Shane Denson, Post-Cinematic Bodies (Lüneburg: Meson Press, 2023), 29–30.

[3] Joanna Zylinska, AI Art: Machine Visions and Warped Dreams (London: Open Humanities Press, 2020), 81.

[4] Lev Manovich, “Remixability and Modularity,” (2005): 1–12. Online at: http://manovich.net/index.php/projects/remixability-and-modularity

[5] Ziv Epstein, et al., “Art and the science of generative AI,” Science 380, (2023):1110–1111.

--

--

Gerui Wang
Turing AI & Arts Forum
0 Followers

Gerui Wang is a Lecturer at Stanford University.