UX Case study: AI voice replicating app

An app that transforms your voice into an AI-generated replica. You can call it, deep fake.

RaghavPrasannaUX
Bootcamp
8 min readJan 17, 2024

--

Fresh screens

One of my friend was starting his content creation journey. Typical Gen-Z stuff. I was helping him along his way, & I’m a Gen-Z too!

Furthermore, there is a big issue. He was… Let's just say shy, anxious and timid. Lets just say, he was lazy to create a voice-over for his video or audio contents. My UX brain started tingling. I initiated an investigation at the core of this issue and how to solve it. More to that point, that is when I realized, my friend was not the only one in this situation. There are billions (Yes, billions) of peeps who struggle on that front.

Whenever there’s a problem, you can find a UX designer sniffing around.

User Research

Initially, I conducted a survey involving more than 50 individuals from various demographic backgrounds, encompassing differences in age, location, profession, gender, and more.

There is a particular way, or set of rules, that I follow to create user surveys/research:

  • Use open-ended questions.
  • Avoid leading questions that suggest a particular answer.
  • Avoid double-barreled questions.
  • Randomize question order to reduce bias.
    (Key tip) Keep the questions easy to understand.

Everyone I interacted with, regarding this issue. Told me that they use Text-speech models. Devices like Amazon Echo and Google Home are becoming increasingly popular, and they instantly use similar technology.

Research pie chart
There is a lot to unpack here. Let’s go!
Research pie chart
Research pie chart
Research pie chart
Research pie chart

Persona’s

User personas enhance empathy and guides my design decisions by displaying the needs, behaviors, and goals of target users.

I understand that case studies are a lengthy wall of text to some. So, I have included just 3 of the personas that fit well into this project.

Research pie chart
Research pie chart
Research pie chart

There is one bad news: I wasn’t able to find individuals with speech impediments for user research. That is something that I regret to inform.

Problem Statement

After the research process, I can come to the conclusion that people needed more uniqueness in their content. Mainly to differentiate themselves from others. It's harder to differentiate with the typical AI voice. They needed authenticity more than colorful features. They need an ‘identity’, more than anything else.

Here are a few pain points to remember.
- Lack of authenticity in text to speech models
- Emotional limitations
- Lack of equipments
- Difficulty with accents

We could also see why text to speech is an undervalued feature for the long run. This presents an opportunity for us to discover a solution. 🕳️

Choosing the Right Approach

The existing issues come from a lack of available apps for users and the underlying technology behind the text-to-speech models. We will have to rely on AI for the quickest solution.

I think this is where AI truly excels!

From the 2022 AI boom, we have seen the development in that sector increase tenfold. Here are some of the articles that outline some of the stories.

Phewww. That's a lot innit! (Yes, I'm in U.K. so the slang is coming in hot)

AI models trained on voice data have gone thru a massive transformation in the past year. For example if you want to get a glimpse of what lies ahead on the future of AI voices. Download chatGPT in your mobile device and enable voice mode. These voices will sound natural and real after 2–3 conversation's. This breakthrough opens up a major market for innovative applications in AI voices sector, as demonstrated by all these links.

Add links to the development and the real people working on AI voices.

Information Architecture

Information architecture encompassing a navigation diagram, an information flow within the application.

IA

“Simplicity often proves to be the most elegant solution, maximizing efficiency and achieving optimal results. This is true for many successful tasks, as it minimizes complexity and focuses on the core elements driving success.”

Low Fidelity Wireframe

Wireframe

Wireframes, ahh. boxes and boxes. Boring ayee?

This is where I had the idea of combining 3D and wireframes into a single screen for the high fidelity prototype. Back then, mobile apps used to struggle with seamless image loading, but thanks to competition in mobile markets. Mobiles have really elevated their game when it comes to mobile devices. upgrades in chip technology and processing power happens every month. Just look at 3D graphics! What was once a PC-exclusive experience is now readily available on your mobile device. Google Earth, with its millions of polygons, is a prime example.

High Fidelity Screens

Leveraging user research, information architecture, and numerous iterations through paper sketches, 3D prototypes, and usability testing, we developed a high-fidelity visual design for the app.

Now presenting… 3D screens! Tadaa!

Screens
Screens
Screens

Let’s break down the screens

I will try my best to showcase the user flow of a persona who wants to create their voice samples. From the moment they install this app. (Page by page)

I wanted the home screen to have less information as possible. Its easier for the users to tipped off from their goal when we overflow them with the entire sitemap from the moment they open the app.

Screens

2* This is where I tried to play with 3D ui’s. Instead of fading off the button of edit-your-voice. I have submerged the button itself under the screen partially.

When it comes to 3D its a different game we have to play here. Due to the fact that we have the z-axis added (from 2D to 3D) into the dimension.

Screens
Screens

4* When they are done with their recordings they can now stop the recording session and go to the next screen where they can.

  • Next. Users have the freedom to continue or conclude their session after it officially ends, either on their own or when session time expires.
Screens

5* Screen for the users to change any major changes before they upload their recordings to the cloud. They can edit their name and manage their recorded samples. They also have the option to re-record to delete a sample from their list. Users can also re-record a sample that they feel unsure of.

Screens

6* Upload and wait for the file to reach the server and process these voices into your AI model. Im a huge believer of showing the progress in a huge way possible. This is where users will be staring at the screen wondering what will happen next. This is also where they hope they wont be shown any error. So. I took some time to think about designing this. User friendly screen especially when they are anxious looking at this screen.

Even more screens for text to image

After that step.

7* You will be shown a different screen with different color tone. This is where will be actually creating your text to voice.

First, select the voice, then write your text there in that empty box. Users can now have multiple controls to export their text-to-speech file.

  • There are additional screens of information available, but they are not essential for the current task.
Some more screens

That's the end of this small little side project I was working on. Also, please click the behance link below. Medium had compressed all the beautiful images into low quality images 💀

https://www.behance.net/gallery/189459873/3D-AI-voice-replicating-app-%28UX-CASE-STUDY%29?

Big ups to those who read this! Before you go:

  • 📰 View more content in my profile
  • 🎨🌏 You can reach out to my 3D portfolio: ArtStation & Instagram

Case study by — Raghavprasanna
Happy learning everyone!

--

--