Talking to My AI (Part #1) — Exploring the surprisingly useful illusion of human-like conversation
by Adam Connors
The AI-Augmented life project is exploring ways that AI might change our everyday lives in the coming years. One theme that stands out in all the interviews, experience sampling, and workshops we’ve conducted so far, is the idea of a personal, shared-context assistant.
It’s a common vision. An assistant to help us with the minutiae of our lives: scheduling appointments; taking notes during meetings; reminding us of the names of our best-friend’s children. We imagine our own personal Jarvis helping to iron-out the wrinkles in life so that we can focus on the things that matter most to us.
It’s no surprise then that both Google and OpenAI have started to put a heavy focus on mobile multimodal assistants that can have human-like conversations with us.
But whilst the natural back and forth of these conversations is super impressive, I’ve been wondering how these agents will evolve into something that is genuinely useful. Once the novelty passes, how much time will we really spend talking to our AI assistants?
I set myself a task:
What happens if I talk to my personal AI for at least 10 minutes every morning?
- What will we talk about?
- Will I find the conversations useful?
- How do AI assistants need to evolve in order to get the most out of them?
I built a very basic voice-to-voice prototyping platform to support this work (see endnotes) and this is my very first conversation with it.
The ELIZA Effect
Looking past the obvious clunkiness of this particular implementation, there are glimpses of something joyful here. Under the hood, Gemini doesn’t know it’s having a voice-based conversation, but talking to it already feels different from a typed conversation.
Here’s a snippet. It’s mundane enough, but for me, in the moment, a small part of me forgot that I was talking to a chatbot, and I began to focus instead on the conversation itself.
Me: If I were to make a habit of talking to an AI every morning, how would I get the best learnings from that process?
AI: There are a few things you can do to get the best learnings from talking to an AI every morning. Set clear goals for your conversations. Take notes or record your conversations. This will help you track your progress and see how your thinking has evolved over time.
Me: It sounds like a diary study. Is there a diary study template I could use to make sure I’m capturing the right information?
AI: There are many templates you could use to capture the right information from your interactions with the AI. Here’s a simple template you could use…
For example, when I say: “Is there a diary study template I could use?” The AI anticipates my intent and leads straight into giving examples: “There are many templates you could use, here’s a simple template…”
Humans love interacting with other humans, and in the moments when the illusion flickers into life — when the AI meets my unspoken expectations — I find myself leaning forward, engaging more deeply than I would have if I were typing. My thinking becomes more nimble, my thoughts flow more quickly.
Voice + language models is much more potent than I’d first imagined.
This tendency to anthropomorphize conversational technology dates back to the 1960s and is known as the ELIZA effect. There are risks associated with anthropomorphizing AI, and Google DeepMind has recently released a major paper on this topic. But there’s a flip-side as well. If we can create the illusion of something that feels just real enough, I think it might add a cognitive supercharge to whatever task we’re trying to complete.
It used to be that talking to a computer was too clunky to create this illusion effectively, but language models have changed all that. Even in this simple experiment with off-the-shelf public APIs, it’s clear that language models make voice much more potent than it was before, and that it will bring about a very fundamental shift in how we think about human and AI interaction. The next generation of AI assistants will provide much more than the short, transactional exchanges we’ve seen before, and this gives them the potential to become a truly ubiquitous part of our lives.
After all, humans have been talking to each other for far longer than they’ve been reading each other’s words.
Beyond first impressions
Out of the box, however, the illusion is fleeting. And even the improved latency and fluidity of duplex streaming or a more expressive voice will only take us so far. The content of the responses is crucial.
There’s a useful analogy here with books and films. Books and films work because they create the illusion of interacting with real people (or aliens, unicorns, elves, etc) and they fall flat when we accidentally break that illusion. Many of the clichés of writing courses — show, don’t tell, for example — are born of our need to not break the illusion that there’s a real intelligence behind the words.
It’s the same with voice agents. When the illusion of a human-like interaction flickers into life, my engagement (my ability to think) rises. But it doesn’t take much for the illusion to come crashing down.
There are broader versions of this question that are asked in the context of evaluating model performance, like: Does this conversation feel natural? But I find “breaking-the-illusion” an intuitive framing that makes it easier to enumerate all the times when the conversation stopped feeling natural:
- If the AI repeats the question I just asked.
- If the AI gives too long of an answer.
- If the AI gives me a set of bullet-points.
- If the AI repeats the same word within a couple of sentences of each other.
- If the AI repeatedly repeats the same phrase (e.g. “That’s a great question!”)
- If the AI tries to answer the question without enough information (humans do this too, by the way)
I spent some time prompting my AI in an attempt to encourage it to break the illusion less often, and I had some modest successes.
Me: If I were to make a habit of talking to my AI every morning, how can I get the best learnings from that process?
Unprompted model: There are a few things you can do to get the best kind of learnings from talking to an AI every morning. [1] Set clear goals for your conversations: What do you want to learn from the AI? Do you want to improve your knowledge of a particular subject? Do you want to get help with a specific problem? Or do you simply want to have interesting and thought-provoking conversations? [2] Be open-minded and willing to explore new ideas. The AI can help you see things from a different perspective and expand your thinking. Don’t be afraid to ask questions and challenge your own assumptions.
[1] Bad: repeated the question
[2] Bad: bullet-points and overly long
Prompted model: To get the best learnings from talking to your AI every morning, focus on specific goals or topics you want to explore. [1] Ask thoughtful questions, engage in discussions, and reflect on the insights you gain. Consistency and active participation will enhance your learning experience. [2]
[1] A more natural response
[2] Better: bullet-points are (somewhat) smoothed out and the response is shorter
But it took a lot of laborious interacting and noting down moments where the illusion broke so that I could update my prompt and try to stamp out those behaviors.
If we’re going to get the most out of voice-to-voice agents — if they’re truly going to become a part of our day-to-day lives — they’ll need to maintain the illusion of human-like interactions more consistently and over a longer time period than they currently can.
That probably means prompting, training, tuning, and evaluating them differently to text-based models. Thinking in terms of “breaking the illusion” is a good yardstick for thinking about the quality of a voice-based interaction and it might prove helpful to anyone crafting prompts, and pulling together training datasets for such agents.
Conclusion
At first glance, adding voice on top of a language model seems like a rather small addition, but even this simple prototype surprised me by how compelling it is to talk to an AI.
To reach their full potential, however, AI assistants need to go beyond single-turn transactional exchanges and be able to maintain the illusion of a real conversation over much longer, multi-turn activities. There’s a lot of nuance yet to be understood in what this entails, and new benchmarks, datasets, and methodologies are going to be needed in order to optimize language models for this kind of interaction.
Over the next few weeks I’m going to use my agent on a regular basis and I’ll report back on how it goes.
Endnotes — The Conversational Toolkit
The test-rig I used to create this blog was really very simple. I used the WebSpeechAPI to convert voice to text, sent the text to the Gemini model on Cloud VertexAI, and then used the Cloud text-to-speech API to convert the response back to voice.
More sophisticated implementations (like Astra and ChatGPT’s mobile app) have been engineered with real-time interactions in mind, and have more expressive voices, but the underlying content of the interactions is much the same. The great advantage of the conversational toolkit is that it is available in open source and uses public APIs, so it is very easy to hack around with.
Language models work best when they’re prompted or part of a system (prompt-chain) and AI assistants require broader integrations in order to be actually helpful. The Conversational Toolkit is available here, and it’s designed to make it super easy to build custom assistants with a single Python class.
There’s often a gap between what is technically possible and what’s ready to be a deployed product, and we hope that by making it easier to play around with what’s technically possible in this space, we’ll learn more about where the really transformational changes might come from.
Subscribe, and watch this space for a future post that will walk through the process of building a voice assistant using this toolkit in more detail.