Robocalls Are About To Get A Lot Worse

Published in

The Signal

4 min readMar 31, 2023

The combination of realistic chat and voice cloning are going to cause a major shift in how we trust anyone on the phone

The other day I got a phone call. Normally, my phone filters out a good deal of spam calls (although not enough for my liking) and I get a lot of business calls on my phone, so I end up answering some of them. Usually against my better judgement.

The call started with a prolonged silence, which is never a good sign.

“Hello?” I said dutifully.

After another pause, the caller responded: “Hello. My name is Juan.”

Another pause, and I said “Can I help you?”

“I’m sorry. Hello? I can’t hear you. I’m sorry. Let me call you back.”

And Juan promptly hung up.

Sounds innocuous enough, and the number did call back, but this time it was a real estate promotion of some kind (above I posted the actual report found on the web). To be clear it was of the kind I didn’t want to listen to and that’s when I hung up. But the call got me thinking about my buddy Juan, and what he really wanted when he called me. I can only guess right now, but I’m pretty sure Juan (if that is your real name) was some kind of calling probe bot, making sure I was a human so that the promotional call would get through to someone who would pick up the phone and listen. Fucking brilliant, frankly.

With the recent GPT and generative A.I. announcements, however, I am now pretty sure this tactic can and will be used for much more nefarious means, and we need to start educating ourselves about what’s coming down the pike. Most of us already know about deepfakes, and we’ve seen some of the recent ones circulating on social media. But right now deepfakes are still in the realm of “someone else’s problem,” and that is changing fast.

Back in June 2022, I attended VidCon in Anaheim, CA. At my company Stainless A.I., we do a lot of work building computer vision applications with a video production focus, and VidCon is a creator conference filled with folks who produce content for platforms like YouTube, TikTok, Vimeo, etc. I sat down for a quick lunch near the food trucks. Two very polite young guys asked me if they could sit at the other side of the bench, I nodded, and they sat for their lunch and started talking in what I could only identify as some eastern European language. I asked them where they were from. “Poland,” was the response, and we began to chat.

They were founders of a company called Eleven Labs, they explained, and their tech produces synthetic voices. Maybe you read about them in the news:

ElevenLabs Raises $2M and Announces AI Speech Platform Promising to Revolutionize Audio…

ElevenLabs launches Beta platform allowing creators and publishers to narrate their long-form content The pre-seed…

blog.elevenlabs.io

To be clear I don’t think these guys are “up to no good,” but their technology is representative of a class of A.I. applications that can replicate any voice, including cadence, stutter, pause and all the other human parts of language we’ve come to expect from members our species. If you want to hear their demo (and the tech has advanced since this early demo), they have one on YouTube:

Eleven Labs can replicate your voice with a sample as short as 1 minute and 30 seconds. You can watch this process here:

I don’t think Eleven Labs will intentionally use this technology for malware but everyone needs to be educated on what’s out there, and what’s coming, because there is no doubt that someone will use this tech for evil. If two kids from Poland can build this technology with a small team, you can be sure other bad actors with greater resources will do so.

I don’t post much video or audio to social media, but hundreds of millions of people do. That high-quality audio can easily be pushed through a cloning process like this that is realistic enough to fool someone on the phone or even in a video chat once head cloning tech becomes mainstream (and as people who develop video A.I. tech, we see it coming to fruition soon). Even my good friend Juan could have recorded my responses on the phone and put those in a database for future use. Once my voice is cloned, people can use LLMs to create realistic-sounding text in my style, having trained the models on my public social media posts. That audio can be generated in real-time, allowing them to convince someone they are talking to me. Coupled with the fact that almost all our personal data is online and available to make convincing conversations the odds are stacked against us. You can read my thoughts on LLMs here.

As a technology entrepreneur I really don’t like to post these kind of gloom-and-doom scenarios. These last few weeks, however, have really changed the game as to what we can now do with A.I. technology and our lives need to adapt — fast. We will need shifts all over the place: economically, politically, and culturally to really mitigate the danger of what’s now going to be possible. We will need counter-technology to enhance trust and even low tech solutions like old fashioned spy phrase-and-response.

Until the powers that be really understand the plot here — and there isn’t really any indication they are paying attention right now — we need to arm ourselves with knowledge and vigilance. Do you have a phrase and response only you and your family know? Create one and use it when family calls to verify their identity. Don’t post it anywhere. Your voice, and soon live video of you, will not be enough.

Robocalls Are About To Get A Lot Worse

ElevenLabs Raises $2M and Announces AI Speech Platform Promising to Revolutionize Audio…

ElevenLabs launches Beta platform allowing creators and publishers to narrate their long-form content The pre-seed…

Written by Dan Stieglitz