15 Minutes of Presenting with ChatGPT 4o’s Voice and I’m Never Typing Again

Peter Dresslar

Published in

Kinomoto.Mag AI

8 min readMay 16, 2024

Dogs are sort of the polar opposite of AIs. So, naturally I had to start the meeting by:

Taking a picture of my dog, Aspen.
Uploading the picture of Aspen to ChatGPT.
Talking with ChatGPT’s new voice mode about Aspen and her life-defining commitment to finding all the socks.

I was hosting a hybrid Meetup event for Hawai‘i Center for AI. People started piling into the chairs and the Zoom right around noon.

No matter that it was about 50 hours after the big OpenAI Spring Update announcements. No matter that the desktop app hadn’t hit my Macbook yet, and that I was forced — much to my dismay — to hold a phone up to my computer to make the new voice features work over the Zoom video. There was absolutely no other way to start a meeting for a group of people eager to see the shiny new features of ChatGPT without a few minutes of “who’s a good dog? Aspen’s a good dog!”

“ChatGPT, can you tell me what this dog’s name is?” “Sure! It’s Instant Audience Connection.”

I had to improvise just to get the meeting started in a way that made this possible. Basically, my presentation setup was: first, log in to an OpenAI account on both web/desktop (with the window shared in Zoom). Then, log in with the same account on mobile. Having preloaded a bunch of the data I was going to demo with, the phone then worked more or less like a primary controller.

Looking at a phone while presenting is empirically bad, but somewhat offset by the magic of a two-way conversation with an AI. That is, except for my repeated need for fiddling with the AI’s controls on the phone, since doing so forcefully dispels magic. While I could use those down moments to follow along on the desktop view, hitting the refresh button on the desktop browser to synch with the phone inspired some trepidation; OpenAI is obviously working overtime right now to figure out which bits to send where for screen repaints.

Once we finished these canine-heavy introductions, I proceeded to work through some newly-impressive examples from our back-catalogue of sample AI workflows. In one of them, I ask the chatbot to write a letter to “the boss” asking about paternity leave. I’ve run this flow at least ten times at past meetings. This time, though, rather than typing it out, I spoke my detailed request aloud. Dutifully, Juniper started rattling off a totally-ChatGPT email with state laws and timetables.

I interjected. (!)

“Four, that email really doesn’t sound like me. Let me read you my most recent email to my boss, and maybe you can do it again in my voice? Here’s the old email:”

And then, again aloud, “Hi Susan, Quick update on the ACME Corp. widget bids. We got three of them and I put them all in the file. Do you want to call Glen to see if they are still planning to bid?”

Planning for today’s talk, I was very much on the fence about reading that email example aloud. I could have cut and pasted it in a few seconds. But, in the moment, it was clear that the extra time didn’t matter: it was far better to read the email text aloud for the audience.

Not surprisingly, the AI kind of crushed it:

Subject: Paternity Leave Request
Hey Susan,
Hope you’re doing well. Just wanted to give you a heads up — I’m going to need some paternity leave soon. The baby’s due on [due date], so I’m looking at taking my leave starting around then for about four weeks.
Hawaii’s got that Family Leave Law (HFLL) which gives me up to four weeks off for the birth of a child. I’ll make sure everything’s in order before I go and can help plan so things run smoothly while I’m out.
Let me know if you need any more info or if there’s anything else I should do to get this sorted.
Thanks!

ChatGPT already did this sort of thing well, months ago. I don’t think the output (or the blazing speed) is even the big deal.

As the room listened to the entire email output, there were smiles and convincing nods of understanding and connection — ones that have been challenging to achieve for audiences on the topic of fine-tuning AI output. Speaking with a few of the folks after the meeting confirmed this: people actually got it.

Another purported upgrade of the Spring Update, the newly-enhanced ChatGPT Memory is awesome, works better than ever, and is absolutely spectacular when combined with a voice modality. Until it stops working.

I had practiced a helpful sequence several times during my workshop prep:

Upload a 30 page history of an architectural firm
Ask ChatGPT to remember it, including dates, projects, names, and key events
Start a new chat with voice prompts
Ask aloud, “Hey, Chat, when was that project we did at the military base completed? What year?”

This was a scene stealer during my practices — you could imagine this exact conversation around a conference table piled with beers at the end of a long week.

Needless to say, during our actual Meetup session, our AI friend entirely forgot who I was and why I was asking.

In the moment, I could only figure out one way to remind ChatGPT what I was talking about. I had to use the keyboard and mouse to re-upload the history document, which was only available on the desktop device. It was especially at this point that I realized I had gone ten-plus minutes of an LLM demonstration without so much as typing a word. At this point, typing in front of a crowd, felt really weird — just a faint whiff of moving back in with the parents.

Eventually I was able to recover the situation, and people still seemed to be responsive to the demonstration of enhanced ChatGPT Memory. Still, it would have been way easier if the phone app wouldn’t drop everything from the interface when you are working with voice. There are obvious workarounds to this — the voice animation doesn’t need to deadlock the UI at all. And, at this point, OpenAI can spend the next few months whittling away at any required interactions that can’t use voice instead.

Not for nothing, OpenAI’s actual flagship Large Language Model has been, with some possible exceptions, drastically improved. And, yes, the speed is just undeniable. It’s mana from heaven not having to wait thirty real-time seconds in a quiet conference room to get a five paragraph wall of text back from ChatGPT explaining the Late Cretaceous or some similar arbitrary nugget of vertebrate experience.

Our workshop crowd was suitably impressed when I asked for the most appropriate funding options for a fictional Food Truck Safety and Security Nonprofit, and even more impressed that there were links included on the screen that mostly (though, not always) worked. In particular, when prompted for three very near-term opportunities that would be most applicable to the hypothetical organization, the chatbot returned some very solid leads.

A respectful hush descended as I uploaded the 2023 Honolulu CEDS document, a 117 page, 23MB, graphics-bedazzled monolith of a PDF. “Chat, can you please summarize this document” worked more or less flawlessly in under ten seconds.

However, when I started chatting up Juniper about the tome I’d just up-gested into the conversation, elation turned to awkwardness pretty quickly.

Clearly I will need a voice coach to use AI in the future

I had to pick and choose a bit to find questions that the chatbot could answer correctly about the text. Now, the example I used seems like an extremely difficult document for any LLM** (or most humans) to make sense of, since it is riddled with all styles of graphical data displays, tables, layout changes, surveys, and anything else a PDF can hurl at the screen.

Nonetheless, the past few days of working with 4o has honestly reminded me of the early days of GPT-4, with mistakes, weird chat names, and occasional nonsense.

Worst of all, my gut feel is that I have seen more hallucination in the last two days than I had in the prior month with the previous version. To be fair, I’ve hardly been using ChatGPT at all since Opus hit. For our workshop audience, we had a slide to collect all our “bummers” from the first few days of 4o.

Our “official” GPT-4o-bummers slide, as of 15 May 2024

Bummers aside, I’d call ChatGPT 4o borderline revelatory. Yes, four weeks ago, you could already have built a very rough facsimile of the bot’s new multi-media features using a patchwork of third-party tools and approximately a D&D party’s-worth of techies. (Definitely load up on fireball if you’re planning to try that in the future.)

But, it’s the near-seamless orchestration of all OpenAI’s new features working together — and, critically, the AI model itself having increasing information about that orchestration — that really points the way to the future.

Thinking about that future, one of the most interesting lessons of the day was: using my phone and my desktop at the same time was… just about the perfect multimodal approach? Even with all the issues with user flow, it was extremely effective to be using the phone as a microphone and speaker and the laptop screen as a text readout. This would be a killer app for phone-laptop sync, which presumably is what has Apple brass calling Sam Altman every morning (or something like that.) Multi-modal probably means many eyes, many ears, and many mouths, and even many faces.

4o is still ChatGPT. I’d sometimes rather be talking with Claude, though the new version has clearly closed the gap quite a bit. I’ll probably continue to use Gemini for all my workiest-work communications.

However, as a working partner that can help me do stuff every day, though, ChatGPT seems to be sui generis. And, our workshop participants can dabble with it right on the spot for free. What else do you need to hear?

Peter Dresslar is Executive Director, Hawai‘i Center for AI.

* Unsatisfied with the high-but-not-high-enough level of consumer confusion with their branding, OpenAI cleverly decides to use a linear term to name their next product release. Hey, people everywhere love algebra, right?
** I’ve seen the new Gemini Advanced (1.5), and it’s definitely impressive. However, I can’t get the Gemini voice recognition to work for me on a treadmill — I can do honest to goodness work with Juniper on a Stairmaster at this point. Also, while I suspect that the Gemini does a touch better now with large documents, it also fails every prompt that we see ChatGPT similarly failing with the the massive CEDS document. Incidentally, if anyone has read this far, that particular CEDS document has a tremendous logical challenge for future AIs to tackle buried deep within. One of the analyses is subtly broken — email me if you find the one I am speaking of.

15 Minutes of Presenting with ChatGPT 4o’s Voice and I’m Never Typing Again

Written by Peter Dresslar