UX Case study: AI voice replicating app

An app that transforms your voice into an AI-generated replica. You can call it, deep fake.

Published in

Bootcamp

8 min readJan 17, 2024

One of my friend was starting his content creation journey. Typical Gen-Z stuff. I was helping him along his way, & I’m a Gen-Z too!

Furthermore, there is a big issue. He was… Let's just say shy, anxious and timid. Lets just say, he was lazy to create a voice-over for his video or audio contents. My UX brain started tingling. I initiated an investigation at the core of this issue and how to solve it. More to that point, that is when I realized, my friend was not the only one in this situation. There are billions (Yes, billions) of peeps who struggle on that front.

Whenever there’s a problem, you can find a UX designer sniffing around.

User Research

Initially, I conducted a survey involving more than 50 individuals from various demographic backgrounds, encompassing differences in age, location, profession, gender, and more.

There is a particular way, or set of rules, that I follow to create user surveys/research:

Use open-ended questions.
Avoid leading questions that suggest a particular answer.
Avoid double-barreled questions.
Randomize question order to reduce bias.
(Key tip) Keep the questions easy to understand.

Everyone I interacted with, regarding this issue. Told me that they use Text-speech models. Devices like Amazon Echo and Google Home are becoming increasingly popular, and they instantly use similar technology.

Research pie chart — There is a lot to unpack here. Let’s go!

Persona’s

User personas enhance empathy and guides my design decisions by displaying the needs, behaviors, and goals of target users.

I understand that case studies are a lengthy wall of text to some. So, I have included just 3 of the personas that fit well into this project.

There is one bad news: I wasn’t able to find individuals with speech impediments for user research. That is something that I regret to inform.

Problem Statement

After the research process, I can come to the conclusion that people needed more uniqueness in their content. Mainly to differentiate themselves from others. It's harder to differentiate with the typical AI voice. They needed authenticity more than colorful features. They need an ‘identity’, more than anything else.

Here are a few pain points to remember.
- Lack of authenticity in text to speech models
- Emotional limitations
- Lack of equipments
- Difficulty with accents

We could also see why text to speech is an undervalued feature for the long run. This presents an opportunity for us to discover a solution. 🕳️

Choosing the Right Approach

The existing issues come from a lack of available apps for users and the underlying technology behind the text-to-speech models. We will have to rely on AI for the quickest solution.

I think this is where AI truly excels!

From the 2022 AI boom, we have seen the development in that sector increase tenfold. Here are some of the articles that outline some of the stories.

Voicemod will now let you create AI voices from scratch.

The popular voice clone tool now lets users create an unlimited number of custom voices that vary by gender, age, and…

www.theverge.com

Respeecher's ethics-first approach to AI voice cloning locks in new funding | TechCrunch

Ukrainian synthetic voice startup Respeecher is finding success despite not just bombs raining down on their city, but…

techcrunch.com

The Finals uses AI text-to-speech because it can produce lines 'in just a matter of hours rather…

A podcast by Embark Studios—creators of the upcoming FPS The Finals—has hinted that the game will be…

www.pcgamer.com

Deepdub, startup with ties to HBO Max and Fox, launches royalty program for AI voice clones

The startup did not provide information on precisely how much money vocal artists would receive every time their voice…

venturebeat.com

YouTube collaborating with labels to replicate artist voices with AI - RouteNote Blog

YouTube is reportedly working on an AI tool that would enable content creators to replicate the voices of famous…

routenote.com

Phewww. That's a lot innit! (Yes, I'm in U.K. so the slang is coming in hot)

AI models trained on voice data have gone thru a massive transformation in the past year. For example if you want to get a glimpse of what lies ahead on the future of AI voices. Download chatGPT in your mobile device and enable voice mode. These voices will sound natural and real after 2–3 conversation's. This breakthrough opens up a major market for innovative applications in AI voices sector, as demonstrated by all these links.

Add links to the development and the real people working on AI voices.

Information Architecture

Information architecture encompassing a navigation diagram, an information flow within the application.

“Simplicity often proves to be the most elegant solution, maximizing efficiency and achieving optimal results. This is true for many successful tasks, as it minimizes complexity and focuses on the core elements driving success.”

Low Fidelity Wireframe

Wireframes, ahh. boxes and boxes. Boring ayee?

This is where I had the idea of combining 3D and wireframes into a single screen for the high fidelity prototype. Back then, mobile apps used to struggle with seamless image loading, but thanks to competition in mobile markets. Mobiles have really elevated their game when it comes to mobile devices. upgrades in chip technology and processing power happens every month. Just look at 3D graphics! What was once a PC-exclusive experience is now readily available on your mobile device. Google Earth, with its millions of polygons, is a prime example.

High Fidelity Screens

Leveraging user research, information architecture, and numerous iterations through paper sketches, 3D prototypes, and usability testing, we developed a high-fidelity visual design for the app.

Now presenting… 3D screens! Tadaa!

Let’s break down the screens

I will try my best to showcase the user flow of a persona who wants to create their voice samples. From the moment they install this app. (Page by page)

I wanted the home screen to have less information as possible. Its easier for the users to tipped off from their goal when we overflow them with the entire sitemap from the moment they open the app.

2* This is where I tried to play with 3D ui’s. Instead of fading off the button of edit-your-voice. I have submerged the button itself under the screen partially.

When it comes to 3D its a different game we have to play here. Due to the fact that we have the z-axis added (from 2D to 3D) into the dimension.

4* When they are done with their recordings they can now stop the recording session and go to the next screen where they can.

Next. Users have the freedom to continue or conclude their session after it officially ends, either on their own or when session time expires.

5* Screen for the users to change any major changes before they upload their recordings to the cloud. They can edit their name and manage their recorded samples. They also have the option to re-record to delete a sample from their list. Users can also re-record a sample that they feel unsure of.

6* Upload and wait for the file to reach the server and process these voices into your AI model. Im a huge believer of showing the progress in a huge way possible. This is where users will be staring at the screen wondering what will happen next. This is also where they hope they wont be shown any error. So. I took some time to think about designing this. User friendly screen especially when they are anxious looking at this screen.

After that step.

7* You will be shown a different screen with different color tone. This is where will be actually creating your text to voice.

First, select the voice, then write your text there in that empty box. Users can now have multiple controls to export their text-to-speech file.

There are additional screens of information available, but they are not essential for the current task.

That's the end of this small little side project I was working on. Also, please click the behance link below. Medium had compressed all the beautiful images into low quality images 💀

https://www.behance.net/gallery/189459873/3D-AI-voice-replicating-app-%28UX-CASE-STUDY%29?

Big ups to those who read this! Before you go:

📰 View more content in my profile
🎨🌏 You can reach out to my 3D portfolio: ArtStation & Instagram

Case study by — Raghavprasanna
Happy learning everyone!