OpenAI GPT-4o: What’s new?

Woyera
3 min readMay 20, 2024

OpenAI has once again pushed the boundaries of artificial intelligence with the introduction of GPT-4o, their most advanced model yet. Building upon the successes of its predecessors, GPT-4o (“o” for omni) represents a significant step forward in multimodal AI capabilities, combining text, image, and audio processing into a single powerful model.

What’s New?

Multimodal Capabilities

The most striking feature of GPT-4o is its ability to seamlessly handle multiple input modalities. It can not only process and understand text, but also analyze any input combination of text, audio, image and video while being able to generate any combination of text, audio and image outputs.

I know what you’re going to ask, how is this different from the current Voice Mode? Before GPT-4o, voice mode works by using 3 separate models: one to transcribe audio to text, GPT-4 to then take that text which then outputs a text, and a third model converts that output text back to audio.

GPT-4o on the other hand is a single model that is trained end-to-end across text, vision and audio, letting it process inputs and outputs all in the same neural network. Because GPT-4o is the first model that combines all these modalities, it can directly observe tone, multiple speakers, understand background noises and it can even output laughter, singing and express emotion. All of which the previous models couldn’t do.

The new flagship model boasts a response time to audio inputs in as little as 320 milliseconds on average, very similar to human conversation, making conversing with the AI feel natural in real time. A massive improvement to the 5.4 seconds average it was before.

Free User Access

OpenAI has also made it their mission to make advanced AI tools available to users of the free version of ChatGPT. Although limited, ChatGPT Free users will be able to use GPT-4o, giving access to features like:

Language Tokenization

GPT-4o’s enhanced language tokenization is a significant step forward in natural language processing efficiency. It utilizes a refined tokenizer that reduces the number of tokens needed to represent text in various languages. This results in faster processing speeds, lower computational costs, and more efficient text generation. Here are a few examples of the new language tokenization:

Conclusion

OpenAI’s latest offering is exciting to say the least. The live demos shown in their website gives us an idea on what to expect as the OpenAI team rolls out the features in the coming weeks including the new Voice Mode and video capabilities.

As GPT-4o becomes more available it is sure to change the way humans and AI interact. The seamless conversation between AI and humans in the demo is quick and natural, pushing the boundaries of what could be possible in AI.

Give us a follow to receive updates on GPT-4o and similar content. Also hit us up at www.woyera.com if you have any questions regarding AI or chatbots!

--

--

Woyera

We build custom, secure, robust chat bots for security & privacy minded enterprises