Building picks and shovels for voice

Before there was the printing press, the pen or even the papyrus scroll, there were stories — stories of wisdom, ideas and understanding of the world passed down generations through oral history. Around campfires in ancient caves and water coolers in modern offices, voice is and always has been the most natural medium for people to communicate.

Anup Gosavi
Spext
3 min readJul 5, 2018

--

Voice as the second citizen

However, when the computer revolution happened, voice became a second citizen on the web, as computers found it notoriously hard to understand voice. The voice content we create — in interviews, videos or podcasts largely remains in a silo, not indexed or easily searchable and discoverable via computers.

There is literally billions of hours of voice that is dark — not searchable or easily accessible by search engines.

The second coming

On one hand you have tons of voice content that is not accessible and on the other hand, voice is having a second coming as a way of interaction. With devices like Alexa, Google Home and Apple Airpods, the friction to access audio content is reducing and in turn, more and more people are consuming voice content.

So clearly, the demand for content is rising. A lot of evergreen content is already present and more will be created. We are clearly moving to a world where voice will be a popular way of computer interaction. So, what’s missing?

Picks and Shovels

Whenever a new method of interaction appears, there is explosive demand for a new type of content but you need easy to use tools — picks and shovels — that anyone can use to create this content

We have seen this before — with the rise of smartphones, touch screens became important and the world needed easy to use tools to create visual content, giving rise to companies like Canva, Invision and Sketch. The word processor did the same for text. Such tools are still missing for voice.

The core issues are:

1. Processing voice is hard — you need to look at waveforms to make any changes to it. What works for music does not really work for voice.

2. Processing voice is expensive — you need to work with transcriptionists and audio engineers who charge by the hour.

If we are to meet the demand for voice content, we need to make interacting with voice as easy as text.

The good news is that speech to text technology is improving and the accuracy is approaching human levels. Like cloud storage, this technology will commoditize and greatly reduce transcription costs.

Source: http://www.kpcb.com/internet-trends (Page 48)

So, finally, after almost 75+ years of computing, we are at a stage where the natural mediums of communication for computers (text) and humans (voice) can come together.

This fusion of text and voice will fundamentally change media and content infrastructure. Billions of hours of voice content in meetings, webinars, conferences, lectures and panel discussions will be unlocked for sharing, repurposing, analyzing and dissemination. And it will be available via voice assistants.

In the future, all you have to do is say, “Alexa, show me where John talks about Paris”

Alexa will know who John is, the videos where he talks about Paris, the time and duration when he talks about Paris and automatically play that portion.

The future of voice is exciting and at Spext we are building picks and shovels to accelerate to that future. We are excited to see what others build. If you want to discuss ideas on future of voice, drop me a mail at anupgosavi [at] gmail.

--

--

Anup Gosavi
Spext
Editor for

Perpetually curious. Simplifier. Co-Founder of Spext.