AI Arms Race Enters New Phase of Scale and Multimodality

What the latest updates from OpenAI and Google tell us about the state of consumer-facing AI

Richard Yao

Published in

IPG Media Lab

10 min readMay 16, 2024

Key Takeaways

OpenAI beats Google to the buzz with a faster, multimodal GPT-4o model
Google’s biggest competitive advantage remains its scale, both in terms of infrastructure capability & consumer reach, as evidenced by its plan to integrate Gemini models into its suite of services and Android devices
Google rolling out AI Overviews to all U.S. users, further abstracting the search experience and potentially kneecapping the larger digital ecosystem
Multimodality is the new battleground for AI companies as major players seek to break AI out of the predominantly text-based interface.
The AI arms race enters a new era, with all eyes on what Microsoft and Apple will announce in their respective developer events in the coming weeks.

Google’s annual I/O developer conference, kicked off on Tuesday. Same as last year, the opening keynote is all about AI, with the search giant announcing a slew of updates to its Gemini-branded large language models (LLMs) in a belabored two-hour presentation.

Yet, OpenAI beat Google to the punch by announcing a series of “Spring Updates” for ChatGPT on Monday, which included a new GPT-4o (that is an “o” for “omni”) model with faster speed and multimodal support. Crucially, it will also be made available for free to all ChatGPT users in the coming weeks. Previously, non-paying users could only access the GPT-3.5 model, and GPT-4o promises to be a significant upgrade in terms of both capability and speed.

Contrasting the two major AI updates, it is clear that the AI arms race that OpenAI kicked off with the launch of ChatGPT over 18 months ago has entered a new era of rapid commodification. As competitions heat up roughly along the battle line of Google vs. Everyone Else, the search giant will have to step up its game on product delivery to retain its core search business while figuring out how to monetize AI.

Need for Speed and Scale

The top hurdles for consumer-facing AI right now are the scale of (free) access and the speed of AI responses. Both the limited scale and the sluggish response time contribute to a widening consumer-AI trust gap, which is being further exacerbated by AI-skeptical press. In this lens, it is fascinating to see both OpenAI and Google taking a crack at these issues.

On OpenAI’s side, the new GPT-4o model will not only be made available to everyone for free, it also represents a huge step towards much more natural human-computer interaction with faster response time. As OpenAI explained in a blog post about GPT-4o:

The new GPT-4o model accepts as input any combination of text, audio, and image and generates any combination of text, audio, and image outputs. It can respond to audio inputs in as little as 232 milliseconds, with an average of 320 milliseconds, which is similar to human response time in a conversation.

OpenAI is also trying to broaden the accessibility of ChatGPT by opening up the GPT Store to all users for free, including the ability to create custom GPTs, just four months after the company opened its GPT Store to paid subscribers. This move, while eliminating an important incentive for paid subscribers, will likely help scale up the GPT Store.

While OpenAI is doing everything it can to hold onto its first-mover advantage in the generative AI space, scale remains Google’s biggest competitive advantage. As the search giant emphasized again and again throughout its keynote event, its robust global network of cloud computing infrastructure, coupled with the unparalleled reach of its suite of products ranging from Gmail to YouTube, still puts the Alphabet company in the prime spot to bring consumer AI products to the mass market worldwide.

Case in point, during its I/O keynote, Google did a couple of live demos of how the latest Gemini models would be integrated into the likes of Gmail and Google Chat to facilitate AI summaries, automated spreadsheet creation, and AI-assisted replies. Here’s one of the demos where Gemini helps a parent summarize emails from their kids’ school into a neat list of must-know items without leaving Gmail.

Features like this will be crucial to the wider adoption of LLM-powered AI features. Helping people simplify and streamline the mundane, repetitive administrative tasks that everyone has to contend with in the digital age might just be the type of killer use case that will help bring generative AI to the mainstream.

As impressive as Google’s on-stage demos were, however, real-world scenarios are much more complex, and it’d be interesting to see how Google would safeguard against any harm that might be caused by AI hallucinations in such use cases that are deeply personal to begin with.

Moreover, Google has plans to eventually integrate Gemini into Android, the widely used mobile OS. However, because Gemini is designed more for data centers than mobile phones, Google will start small with the upcoming Pixel 8 Pro phone, which will be powered by Gemini Nano, a slimmed-down version of Gemini that is designed for on-device local computing.

AI Overviews Are Coming to Everyone

Search is another key channel that Google is looking to scale up Gemini and stay ahead of the competition. With the roll-out of AI Overviews, which will be made available to U.S. users in the coming weeks and to a billion global users by the end of 2024, Google is not only improving the search interface, but redefining how users interact with web content, all under a new catchy slogan, “let Google do the Googling for you.”

AI Overviews will utilize Gemini AI to deliver comprehensive summaries and insights derived from multiple sources, providing users with a richer, more informed starting point for their inquiries. Moreover, Google’s Gemini project showcases an ambitious attempt to handle complex planning tasks like trip itineraries. By maintaining and updating plans based on dynamic variables such as wakeup time or weather conditions, Gemini exemplifies how AI can transform the search experience

Despite emerging competitors in the search engine market, such as the ChatGPT-powered Bing and Perplexity AI, or even rumblings of a potential OpenAI search product, it will take a lot more than buzz for any challenger to unseat Google’s continued dominance over the search market. The wide-scale rolled out AI search not only keeps Google at the forefront of AI but also challenges competitors to match its pace.

Ironically, the only one that might be able to unravel Google’s search dominance is Google itself. The rollout of Gemini-powered AI Overviews to supplement and, in some cases, replace the search results, could easily backfire if not handled carefully, and hurt Google’s bottom line. For now, Google is simultaneously milking and killing its cash cow (search) while trying to figure out how to monetize consumer-facing AI.

Moreover, this shift towards AI-generated content comes with significant implications for traditional web ecosystems. As Google Search leans into AI-generated answers, creators and publishers who rely on search traffic to run their businesses are threatened with extinction. Analyses here show that search-driven ad revenue is about to fall off a cliff. Market research firm Gartner predicts that web traffic from search engines will decrease 25% by 2026. As Google redefines search, it may not only be leading innovation but also prompting a reevaluation, and perhaps transformation, of how the digital ecosystem operates.

Multimodality As the Next Frontier

Another remarkable thing about GPT-4o is the personality it brings to the interaction. Take this taped demo below for example: ChatGPT was able to mimic a friendly, and somewhat flirty, tone to the conversation, which calls to mind the Scarlett Johansson-voiced AI companion in the 2023 film Her, in which the lonely protagonist played Joaquin Phoenix falls in love with the said AI companion.

While the improved speed is a big part of facilitating a naturalistic experience for the users, much of the perceived intimacy comes from the natural speech pattern that includes pauses and giggles, as well as the model’s multimodal ability to gather information via visual input. In other words, because GPT-4o is natively multimodal, it can “see” and “hear” and “speak” in an integrated way with almost no delays. It can see what you are doing, react to it, and even respond to interruptions, just as a real person would.

Not to be outdone, Google repeatedly emphasized that Gemini is “natively multimodal” throughout the I/O keynote, and showcased many multimodal use cases. For example, there’s a Live feature that will roll out with the Gemini 1.5 Pro model in the coming months, allowing users to speak and converse with Gemini in near real-time. Judging by the demo, it functionally works the same as the previous GPT-4o demo, but lacks the same anthropomorphic personality. Another “ask questions via video” feature is also coming to Google Search, merging conversational interactions with visual search.

In addition, Google announced Project Astra, a real-time AI agent prototype that can see, hear, and perform tasks on a user’s behalf. An on-stage demo showcased a voice assistant responding to what it sees and hears, including code, images, and video — capable of advanced reasoning and recall. Google also showed off ‘AI teammates’, agents that can answer questions on emails, meetings, and other data within Workspace. Google says public access for Astra is expected to launch via the Gemini app later this year.

Google has unveiled a suite of new AI creative tools powered by its Gemini AI, including the Veo video model, which can generate over 60-second, 1080p resolution videos from text, image, and video prompts. The “Imagen 3” text-to-image model offers improved detail, text generation, and natural language understanding. There is also a new VideoFX tool that allows for storyboard scene-by-scene creation and adding music to videos. VideoFX is currently launching in a ‘private preview’ for select U.S. creators. Meanwhile, ImageFX, incorporating Imagen 3, is now available via a waitlist.

As impressive as these demos of Gemini’s upcoming multimodal features are, Google was quite vague about the delivery timeline of Gemini multimodal features, which, in turn, made its repeated emphasis on the multimodality feel somewhat reactionary to GPT-4o. Google is a great software company that often lacks the patience to stick with imperfect products that don’t take off at launch, resulting in a long list of scrapped projects that were half-heartedly launched and quickly disregarded This culture of organizational impatience is going to hurt Google in this new era of AI arms race, where competitors are eager to pounce for a paradigm shift.

The Dynamic AI Competitive Landscape

It’s unsurprising that Google is leveraging its massive scale to muscle its way into the consumer AI market. Yet, despite its cutting-edge lead in AI research, the company has been suffering under the weight of being a market incumbent, as it tries to defend the agile advances of a Microsoft-aided OpenAI while navigating through multiple AI-related controversies. If LLM truly turns out to be a paradigm-shifting technology, then Google can’t simply rest on its massive infrastructure and scale to push Gemini. The AI arms race is only just starting, and Google is already losing its edge.

Looking ahead, it’s all eyes on what Microsoft and Apple will announce in their respective developer events in the coming weeks. Early reports say Apple is overhauling Siri to catch up with ChatGPT after the company realized the voice assistant looked outdated in comparison. In the meantime, Apple is reportedly finalizing a deal with OpenAI to bring ChatGPT features to the iPhone.

Some may read into Apple’s very visible presence at OpenAI’s event on Monday (All demos were done on iPhones, and some people wore Apple Watch.) as a clue of Apple and OpenAI getting close to formalizing a possible partnership, even though Apple is reportedly still talking to a number of companies regarding LLM partnerships.

One key question remains whether hardware will become a factor for LLM-powered products to go mainstream. As previously mentioned, Google is already planning to integrate Gemini natively into Android, starting with the new Pixel phone. One can’t help but wonder if that means the iOS vs. Android mobile duopoly would be replicated on the consumer AI front as well with the battle line being drawn.

This hardware question also extends into the processing chips level. During its I/O keynote, Google announced Trillium, its 6th-generation TPUs that deliver 4.7x the performance of the previous model for advanced AI model training and tasks. Meanwhile, Apple introduced a new M4 Apple Silicon chip along with the latest iPad Pro models earlier this month, which was specifically designed with on-device LLM processing capabilities in mind.

For those looking for a third-party candidate, Amazon is placing a multibillion-dollar bet on Anthropic’s Claude model, as well as its forthcoming Olympus model. Anthropic just launched Claude as an iOS app in Europe, but Anthropic’s marketing for it has been largely nonexistent. It got a lukewarm reception compared to ChatGPT, with only around 157,000 total downloads in its first week on the App Store. Perhaps some of Amazon’s investment should be leveraged to help get Claude’s name out there.