The Multimodal AI Revolution:
Why I’m excited about applications and infrastructure built on top of and for the multimodal world
As humans, we experience the world through a variety of mediums, navigating effortlessly across different modes of information — text, audio, images, video. Multimodal AI* is the orchestration of this reality, a natural evolution of AI that mirrors the complexity of human perception and interaction.
The last few years (starting from 2017 if you’re counting based on when the transformer model came out, 2022 if you’re counting when ChatGPT brought LLMs into mainstream consciousness) was all about LLMs. Alongside the rapid improvements of LLMs, image, video, and speech models have also been improving.
We now have the building blocks to create performant applications that stitch together existing text, image, speech, and video models. Namely, we have open-sourced and closed-source models that handle text-to-speech (TTS/speech synthesis e.g. Play.ht, ElevenLabs, WellSaid Labs), speech-to-text (voice recognition e.g. Whisper), text-to-image (image generation e.g. Stable Diffusion, Midjourney, Imagen, DALL-E), image-to-text (image captioning e.g. GPT-4V, LLaVA), text-to-video (video synthesis e.g. Sora), and video-to-text (video transcription). As these models continue to improve, they become increasingly “good enough” to support and augment humans.
So what are the different approaches to building products and businesses in this multimodal world? Not unlike other industries, you can choose to build at the application-layer (serving business or consumer end users), API layer (serving developers), or infrastructure layer.
At the application layer, there are several choices to make — including the breadth of the product (i.e. do you build 1) horizontally across industries and use cases — possibly as an API-first product, 2) vertically around specific industries e.g. healthcare, fintech, supply chain, etc. or 3) functionally around specific job functions e.g. customer support, sales, business operations, etc.?) as well as the shape and size of the ICP (e.g. enterprise or mid-market/ SMBs). Startups here will compete on the ease of integration with existing workflows (and, as always, on distribution), and ultimately need to figure out how to deliver (and show) ROI in terms of employee hours saved and/or new revenue generated.
At the API layer, the value proposition is seamless integration of multimodal capabilities into existing platforms. Startups here will compete on quality, price, and developer experience (and, as always, on distribution). The starting point on quality will be higher than it is in other spaces, as the building blocks being stitched together are widely available. Additional improvements on quality will have to come from model optimizations and the coordination and orchestration of these different models.
At the infrastructure layer, there are opportunities to tackle various technical challenges (many of which parallel opportunities seen in the LLMOps world), including how to process and manage multimodal data at scale, how to deploy large models efficiently in production, and how to evaluate model performance and robustness. Talented AI engineers and research scientists will have the opportunity to create, evangelize, and productize best practices as more multimodal applications are brought to life.
Thank you to ChatGPT and colleagues, friends, and founders in the space for the wonderful conversations that have helped shape these ideas.
*Footnote: A quick note to describe the difference between two related concepts — multimodal AI and multimodal ML. ChatGPT did a pretty good job here:
“Multimodal AI involves the integration of multiple modalities or types of data, such as text, images, and audio, to build systems that can understand, interpret, and generate information across different sensory channels.
Multimodal ML refers to the use of machine learning techniques in handling and processing data from multiple modalities. It is a subset of machine learning that deals with the integration of information from diverse sources.”