The Future Will Not be Voice First

Daivik Goel
uWaterloo Voice
Published in
7 min readApr 28, 2024

In the last couple of weeks, there’s been a real discussion about whether voice will be the primary input in an AI future. The team at Humane and Naval Ravikant really believe so.

I tend to like both Bethany and Naval, but I really disagree with this take. Here’s why:

  • A multi-tier approach with voice as an optional secondary input just makes more sense to me
  • Having voice as the sole or primary input misses how 95% of people live

I’m happy to be proven wrong, but introducing such a limitation on your product means its growth potential will be severely limited, and it can never truly compete with a product that has a two-tier approach.

Which is alright for an experiment like AirChat, but fundamentally flawed when you release a product asking for $700 from customers after raising over $280 million in VC funding.

Why do I feel so strongly about this?

What inherent benefits make Naval and Humane believe voice is the future?

Why do I think having voice as the primary input is a move that will fall flat?

Before I dive in, it’s important to reiterate that this isn’t a criticism of voice as an input method but rather of it being the sole or primary input in the future. Let’s discuss:

Usage within a Day

In an ideal world with flawless speech recognition, there’s no doubt that speech would be the fastest way to convey information. But we don’t live in an ideal world, and there are countless scenarios where voice as a primary or only input mechanism fails.

In fact, I’ll go so far as to say it would fail in 80% of most people’s days.

Think about it:

  • How many people want to be talking out loud at work?
  • Do I really want Jimmy from the office knowing I sent a text to my ex?
  • You want to ask your AI pin to send stuff while you’re on the toilet?
  • How many people want to say anything on a train or plane?

How many scenarios are there where you can realistically say whatever you want without disturbing or being disturbed by people around you, or without them knowing what you’re thinking and sending?

And is the 5 seconds you save in those moments really worth buying a whole product for $700?

If I’m with my nephew and want to take a photo of him, I don’t want to tell the Humane AI pin to “take a photo of my nephew” and risk it not recognizing what I’m saying because he’s running around screaming.

I’d much rather discreetly take out my phone and capture the moment without him noticing or potentially going through that whole voice command flow.

In the same breath, sometimes writing something profound takes multiple revisions. Having to redo an entire tweet or thought rather than editing specific parts is super annoying. It’s easy to say people will just re-record the parts they want to change, but as anyone who’s created video content will tell you, it’s extremely annoying.

I don’t see why or how people would prefer that over just editing the specific parts they want to change.

Despite these arguments, I’ll agree that in the 5% of scenarios where voice input is beneficial, the low latency between communicating something and getting a response is definitely a plus.

But that makes it all the more baffling to me that AI Pin was released with such insane latency when that was its biggest inherent benefit!

If you can’t provide an experience faster than a smartphone, even early adopters won’t pay $700 + $24 per month.

Context

I think context in the words you’re saying is a definitely a factor that can’t be replicated via keyboard. It is fundamentally the justifying property for something like AirChat. It’s cool and truly inarguable.

But do you need to add voice context for everything you say online? How often is it a real value-add?

If Twitter released a similar feature, would voice context on every post justify sticking with AirChat?

Do you care about voice context for 95% of the content you consume?

Are you willing to pop in headphones or fire up your speakers every time you read something to get context?

My guess is no.

Accessibility

For people with disabilities, this technology can absolutely be groundbreaking. There’s no argument there. However, voice as a primary input hasn’t been argued as being just for people with disabilities but for the general public too. So we need to look at it from that perspective.

There’s definitely an inherent benefit for people speaking different languages, and real-time translation is a cool feature.

But how many scenarios do you encounter that require this on a daily basis?

Is that enough to justify buying a primary voice device rather than having an app on your phone that does the same thing?

So, why do legendary product builders like Naval and the Humane team feel differently about this?

Why didn’t they adopt a multi-tiered approach where the user can choose their preferred input method?

After racking my brain trying to figure out why I’ve landed on a simpler explanation than you might think: it’s just something that differentiates them from the status quo of devices and platforms.

A Humane AI Pin with a keyboard and screen is essentially an AI-first smartphone that’s worn on your shirt rather than kept in your pocket, always watching and listening.

AirChat with the ability to type is essentially just Twitter with the option for voice-recited tweets.

We’ve reached a point where the existing giants are really, really good at what they do. Creating a new product to compete nowadays is hard, even more so when you’re asking people to change the way they interact with technology.

What facilitated those behavioral changes in the past was a massive inherent benefit that people would miss out on if they didn’t switch. And even then, these changes still take time. People were using BlackBerry devices for years for typing, solely because that’s what they were comfortable with, despite the availability of touchscreen smartphones.

So, my big question for voice-first products like AI Pin and AirChat is: what’s the massive inherent benefit that only a voice-first product can truly offer that can’t be achieved with a multi-tier approach?

Until that question is answered, I really don’t see a future for products that embrace being voice-first outside of niche usage.

In my opinion, the only way to create a successful voice-first product or application is when it’s used in situations where voice is the most convenient input method, like when you’re driving, cooking, or your hands are occupied.

But justifying those moments with an entire product or application rather than an additional feature on your current products and apps?

That’s a tough question for me.

What do I think the next breakthrough in input will be?

Honestly, I have no idea, but there are a couple of interesting things on the horizon.

My friend Chris is working on a headset that tracks your mouth movements to decipher what you’re saying without actually saying it out loud.

Obviously, it’s not perfect in its current state, but maybe if the technology gets good enough, combining it with AR glasses that have similar capabilities to the Vision Pro could be something interesting albeit we need significant strides for that 🤷‍♂️

The best real path I see is something like Neuralink, which as a consumer product would be an insane paradigm shift for everything. Maybe we’ll all end up looking like Thufir from Dune when we are scrolling Instagram.

Perhaps whatever Jony Ive and Sam Altman are cooking up will be the next big thing. I did a deep dive on that a couple of months ago.

I’m definitely curious to see what happens, and I really respect the attempt to try something different. But my bet is that voice-first won’t be the primary input method in the future.

Check out this Podcast Episode!

You can also listen on Spotify, Apple Music, and Google Podcasts

Join us for an insightful episode with Utkarsh Sengar, Director of Engineering at Webflow, as he delves into the experience of transforming website building with Webflow.

Uncover the strategies that propelled Webflow to dominate the market and explore how the platform navigates the line between no code and traditional coding functionalities.

I recently started a bi-weekly newsletter about IRL events, startups and building products! These articles will be posted there first so feel free to subscribe

Thanks for reading,

Daivik Goel

--

--

Daivik Goel
uWaterloo Voice

Supercharging the Creator Economy | Founder | Writer | uWaterloo Computer Eng Grad | Host of The Building Blocks Podcast | ex. Tesla, Cisco Meraki