GPT4 Omni — So much more than just a voice assistant

3 min readMay 14, 2024

Today we have the OpenAI Spring announcement and it was mind blowing, I guess we all can agree here. I was playing with my new voice assistant for most of this night(mentions of the movie HER very accurate in this scenario).

But as revolutionary, incredible and world changing as the voice capability is, the whole GPT4o model goes so much further than that.

I must confess, the voice thing got me for real, so even I took a time to really stop and read the technical announcement of this model and when I finally did just now my mind was blown again.

I was thinking GPT4o was just a better optimized version of GPT-4 Turbo, this time with better reasoning, less latency and trained for voice conversations. And that they basically just combined all the tech they already had with Whisper and TTS along with calls with the new optimized model and integrated all in ChatGPT in a very effective way.

But after reading the technical report of the model, i saw this:

With GPT-4o, we trained a single new model end-to-end across text, vision, and audio, meaning that all inputs and outputs are processed by the same neural network. Because GPT-4o is our first model combining all of these modalities, we are still just scratching the surface of exploring what the model can do and its limitations.

A single new model with end to end text, audio and vision multimodality!!

A single model that can have an input of text/audio/image and make an output of text/audio/image.

I know I posted at the end of the last that the future of GenAI was multimodality across all modalities and we see initiatives in that sense.

But I never thought that this future would be in May of 2024 and we had a model that is able to process and generate all major modalities while still having a good response time.

Now, this is beyond revolutionary, OpenAI cleans the board again, no one has anything close to this and the possibilities of such a model are so big that our minds have trouble processing it.

We have to revise all our previous concepts and ideas because limitations that make it not reality before might not exist today and also prepare our minds for new renovated ideas of solutions we never even imagined possible before.

PS1: They did not release access to all modalities in their api yet, right now we have text and image. So, we can think about solutions with the others but have to wait for their release that has no defined date until now.

PS2: By the examples the model also generates 3d images.

PS3: Is valid to mention that the model today cost half the price of GPT-4 Turbo, so it is way more than Turbo ever was and is more efficient costing less.

Here is some benchmarks showing some on par performance with most recent top models:

You can see more about this model here, including examples of use:

Hello GPT-4o | OpenAI

GPT4 Omni — So much more than just a voice assistant

Written by katerinaptrv