Exploring the new GPT-4o model capabilities

Guilherme Rocha
Indicium Engineering

--

Last month, on the 13th, OpenAI took everyone by surprise by announcing their latest model, GPT-4o, a day before the Google I/O event.

Let’s take a closer look!

What is GPT-4o?

GPT-4o, with the “o” standing for “omni”, is a next generation model designed to handle text, audio, and video inputs, and generate outputs in text, audio, and image formats.

This unified approach means that all inputs are processed by the same neural network, which results in more coherent and smooth interactions.

It performs similarly to GPT-4 Turbo in English and coding tasks. It also is much better at text processing in non-English languages. The API improves speed and reduces costs by 50%.

Currently, the API only supports text and image inputs with text outputs, similar to GPT-4 Turbo. But additional modalities, like audio, are coming soon.

Multimodal capabilities, real-time interaction and responsiveness

Before GPT-4o, Voice Mode for ChatGPT included a pipeline of three separate models to execute these tasks:

  • Transcribe audio to text
  • Input and output text
  • Convert text back to audio.

The model missed out on nuances like tone, multiple speakers, or background noises. It could not output laughter, singing, or emotions effectively.

GPT-4o now integrates all these capabilities in one single model. It understands and outputs emotions more effectively. It also mimics human response time in conversation, with an average of 320 milliseconds.

Practical examples — Image processing

Let’s take this image processing example from OpenAI’s cookbook. First, let’s start installing and upgrading the OpenAI SDK for Python and then initializing the client:

pip install --upgrade openai
from openai import OpenAI 
import os

client = OpenAI(api_key=os.environ.get("OPENAI_API_KEY", "<your OpenAI API key if not set as an env var>"))

GPT-4o can directly process images and take intelligent actions based on the image. We can provide images in two formats: Base64 encoded or URL.

Let’s first view the image we’ll use:

from IPython.display import Image, display, Audio, Markdown
import base64

IMAGE_PATH = "data/triangle.png"

# Preview image for context
display(Image(IMAGE_PATH))
Triangle with sides measuring 6, 5, and 9
Example triangle

Now we can encode it to Base64 and send it to the client, for instance let’s ask what’s the triangle area:

# Open the image file and encode it as a base64 string
def encode_image(image_path):
with open(image_path, "rb") as image_file:
return base64.b64encode(image_file.read()).decode("utf-8")

base64_image = encode_image(IMAGE_PATH)

response = client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": "You are a helpful assistant that responds in Markdown. Help me with my math homework!"},
{"role": "user", "content": [
{"type": "text", "text": "What's the area of the triangle?"},
{"type": "image_url", "image_url": {
"url": f"data:image/png;base64,{base64_image}"}
}
]}
],
temperature=0.0,
)

print(response.choices[0].message.content)

And we get the correct answer on our console:

To find the area of the triangle, we can use Heron's formula. First, we need to find the semi-perimeter of the triangle.

The sides of the triangle are 6, 5, and 9.

1. Calculate the semi-perimeter \( s \):
\[ s = \frac{a + b + c}{2} = \frac{6 + 5 + 9}{2} = 10 \]

2. Use Heron's formula to find the area \( A \):
\[ A = \sqrt{s(s-a)(s-b)(s-c)} \]

Substitute the values:
\[ A = \sqrt{10(10-6)(10-5)(10-9)} \]
\[ A = \sqrt{10 \cdot 4 \cdot 5 \cdot 1} \]
\[ A = \sqrt{200} \]
\[ A = 10\sqrt{2} \]

So, the area of the triangle is \( 10\sqrt{2} \) square units.

Conclusion

Combining many input modalities, including audio, visual, and text, improves the model’s performance across various tasks.

This approach allows for more comprehensive understanding and interaction, mirroring how humans perceive and process information more closely.

--

--