Decoding the Sound of Virality: A Deep Dive into Adversarial AI for Voice Conversion Tasks (on M1 Mac & Google Colab)

Jordan Harris
8 min readAug 29, 2023

--

Welcome to an in-depth explanation and reverse engineering of the Retrieval-based-Voice-Conversion-WebUI software for local preprocessing on an M1 and inference using the HuBERT & So-VITS based architecture & associated models that are used to make viral TikTok voiceovers, song-covers, and marketing campaigns.

Singing Voice voice conversion task
Speaking voice conversion task

There are many things to speak of when considering the intersection of today’s digital age and the volatile geopolitical landscape of late-stage capitalism. Every 9 months there is a new scientific paper written from any of the developed nations that completely reshape the AI/ML academic landscape. On the other side of this coin, capitalism reigns supreme with people using AI on the masses without proper research on its psychological or sociological effects. Through both the influence of dubious tech companies or altruistic open-source engineers and academics hoping to empower each other, technology in the generative AI space has taken leaps in the past years. At the very least, it has given rise to hilarious and irreverent ‘dadaist’ takes on music and politics which, now in retrospect, has been incredibly tame in the grander scheme.

The phenomenon of voice-manipulated content, as seen in the countless TikTok memes, is increasingly being led by a younger audience. Many of them were ‘iPad Babies’ who didn’t know, or even have interest, in a world before this type of influence. Even less, how to recognize its effects. As this tech-savvy population ages, they will become a powerful voter group, that are experts in their realm. As time passes they will continue to change our social landscape, for better or worse, but either way for our future. My call to action here is to spread an understanding of how these systems work at their base to keep the power in the hands of the people.

Voice Conversion

Speaking to the basics of this code’s use case, the AI system uses a statistical iterative approach to ascertain knowledge from a never-before-seen dataset. Meaning, that it analyzes small chunks of a dataset for specific features or aspects of the input. Sometimes preset, sometimes learned as it works/ It then remembers everything it learned from each tiny slice from the input data set. Then in the end can make a final ‘decision’ based on what it has learned based on the totality of its exposure/learning. This ‘decisive’ output could be a yes or no answer, a .wav/.mp3 file with a new song created, or text that seems to be intelligently designed.

The important thing to remember that the result is only every a total sum of its <input> parts.

The So-VITS-SVC (Soft Voice Capture) tasks assess an input dataset, usually n <input> artist’s albums sliced into many tiny parts, to learn all of the aspects of the singer’s voice. The key human problem is to decide what are the aspects of the singer’s voice that the machine needs to learn. This is what separates a text generator, or image generator from a voice generator. The specifics of what is being learned are different in every case and the underlying architecture of the ML model needs to be engineered to be receptive to this specific type of information.

So-Vits Architechture source

The difference between learning based on text, pixels, or sound waves?

The sound of music: What is being learned?

Preprocessing :

The prep for the encoding task of the singer’s voices involves resampling the audio to 44kHz which is standard for recording albums. The “sample rate” of an audio file refers to the number of samples of audio carried per second, measured in Hz or kHz. The sample is then sliced into clips and then encoded for ml model consumption. You can also ‘Diairize’ the sound files before to get a timestamp record of when the singer/speaker changes within a given input sample. (Using songs that have no features removes the need for this.)

Here are some simple scripts that I used to do this locally on my M1 Mac or on Google Colab:

Here we train a sound generator on specific aspects of a singer's signature sound:

def train_and_evaluate(rank, epoch, hps, nets, optims, schedulers, scaler, loaders, logger, writers):
net_g, net_d = nets
optim_g, optim_d = optims
scheduler_g, scheduler_d = schedulers
train_loader, eval_loader = loaders
if writers is not None:
writer, writer_eval = writers

half_type = torch.bfloat16 if hps.train.half_type=="bf16" else torch.float16

# train_loader.batch_sampler.set_epoch(epoch)
global global_step

net_g.train()
net_d.train()
for batch_idx, items in enumerate(train_loader):
c, f0, spec, y, spk, lengths, uv, volume = items
g = spk.cuda(rank, non_blocking=True)
spec, y = spec.cuda(rank, non_blocking=True), y.cuda(rank, non_blocking=True)
c = c.cuda(rank, non_blocking=True)
f0 = f0.cuda(rank, non_blocking=True)
uv = uv.cuda(rank, non_blocking=True)
lengths = lengths.cuda(rank, non_blocking=True)
mel = spec_to_mel_torch(
spec,
hps.data.filter_length,
hps.data.n_mel_channels,
hps.data.sampling_rate,
hps.data.mel_fmin,
hps.data.mel_fmax)

with autocast(enabled=hps.train.fp16_run, dtype=half_type):
y_hat, ids_slice, z_mask, \
(z, z_p, m_p, logs_p, m_q, logs_q), pred_lf0, norm_lf0, lf0 = net_g(c, f0, uv, spec, g=g, c_lengths=lengths,
spec_lengths=lengths,vol = volume)

y_mel = commons.slice_segments(mel, ids_slice, hps.train.segment_size // hps.data.hop_length)
y_hat_mel = mel_spectrogram_torch(
y_hat.squeeze(1),
hps.data.filter_length,
hps.data.n_mel_channels,
hps.data.sampling_rate,
hps.data.hop_length,
hps.data.win_length,
hps.data.mel_fmin,
hps.data.mel_fmax
)
y = commons.slice_segments(y, ids_slice * hps.data.hop_length, hps.train.segment_size) # slice

Learning:

As mentioned before, the machine learning task takes many slices, around 3-20 seconds each, from a singer's discography (or speaker's public speeches) and checks each clip for:

1. c **(`filename.wav.soft.pt`)**: A statistical/encoded representation of the entirety of the audio input. Answering the question of: What does everything as a whole sound like? The script uses an encoder model (hubertsoft) to encode the audio, transforming it from a time-domain waveform to a more compact, higher-level representation.

2. f0 **(`filename.f0.npy`)**: This data contains a numpy array of the fundamental frequency (f0), or the perceived pitch, and the pitches relation to unvoiced/voiced (UV) flags or aspects of the sound. (More on what an unvoiced and voiced flag is later on) In voice conversion, pitch patterns may vary between speakers, learning their relation to other aspects of the sound allows for deeper understanding. The SoVits architecture uses the selected f0 predictor (“crepe” is used in my case) to compute the fundamental frequency and their corresponding unvoiced/voiced flags.

Frequency: Frequency refers to the number of cycles of a waveform that occur in a second. It is measured in Hertz (Hz) and determines the pitch of a sound. In the context of audio and voice signals, frequency can represent the unique tones or pitches present in the sound. Pitch: is the human perception of the frequency of a sound. It is how high or low a sound is perceived to be. For instance, increasing the pitch can make a voice sound more childlike while decreasing it can make it sound deeper.

3. Spec **(`filename.spec.pt`)**: The spectrogram of the audio frequency content of a signal over time. It provides a visual representation of the audio signal, capturing both pitch and timbre information. The visual representation can capture different information than just a numerical representation can, which helps guide the generative model to accurately reproduce the nuances of a singer's voice.

Source

The spectrogram in So-Vits is computed from the normalized audio using these provided parameters:

filter length: Refers to the number of samples in each segment of audio analyzed, using the Short-Time Fourier Transform (STFT), when generating a spectrogram. It affects the frequency resolution of the spectrogram.

hop length: Refers to the number of samples between successive frames in time-frequency analyses like the STFT. It dictates how much each frame or window of the audio signal is shifted relative to the previous frame. The overlap between consecutive segments impacts time resolution.

win length: The actual number of samples in each segment before any zero-padding. Determines the portion of the segment that contains data.

4. **Y**: This is the raw .wav file of the input snippet. A slight redundancy considering that the first input is the encoding of the raw input, but there may be aspects that need to be included that this redundancy may cover.

5. **Spk**: This is the name of the speaker in the cases where the input is learning to sing different singers' parts. For simplicity and specificity, I have limited input to just one singer's voice.

6. **Lengths**: This assesses the lengths of each frequency expression of the waveform. For instance how long a singer might hold a high note versus a low one, and in what sequence do they do this? SZA is a perfect example of a singer with a very distinct approach to ‘length of sounds’.

7. UV**(`filename.f0.npy`)**: This representation is very interesting and assesses what parts of sung words are unvoiced (like “s” or “f”), while others are voiced (like vowels). Singers all have different approaches to their instruments providing a distinct ‘fingerprint’ for everything they produce, from certain regional accents or personal quirks that may lead to cutting the ends of certain words or extending the sound of certain sung vowels. This is a key part of the task.

8. Volume **(`filename.vol.npy`)**: Represents the volume or loudness levels of the audio segments. Again, when and where a singer chooses to be Forte or diminutive with their sound is a part of their sung ‘fingerprint’.

The Adversary:

Above we explained the basics of what data transformations go into training the Generator. Now let's explain its ‘Adversary’ and how they relate to each other. The base of the So-Vits approach is to use a Generative Adversarial Network (GAN). The base model used for this transformation is the “sovits4.0-volemb-vec768” made by the original So-Vits team.

Generator (G):
The generator’s task is apparent, it generates its own version of the sound features that it learns from, which are then transformed into a unique transformed voice sample that can then be assessed. it might take in a voice sample and some conditioning information (e.g., target speaker identity) and output a transformed voice sample.

The transformation is done by a mel_spectrogram_torch ML model whose entire job is to transform all of the output for the generator into a sound format that is tailored to the sensitivities of human auditory perception.

Discriminator (D):
The novel part of this is the discriminator, which is an extra model that tries to learn how to distinguish between real and fake samples. It is given the voice sample that the generator produces, and an unseen snippet from the original dataset that the generator has not seen, and it makes a True or False choice on whether or not the generated sound was also produced by the original singer.
The input and output shapes of the generator and discriminator are tailored to each other and can not be plugged into any other model architecture.

This process is repeated until convergence; meaning a sound file produced that is generated that is indistinguishable from the original input samples.

Here is the code that is forked from So-VITS that I tailored to support M1 Mac and Google Colab runs. It supports preprocessing and inference locally, but unfortunately, the more computationally expensive tasks of training and diairization need to be done using the Google Colab scripts.

!THIS IS A WORK IN PROGRESS!

Text-Encoder

I added a folder for text encoding that can produce .wav files from text input as well as plot the resultant spectrograms so you can infer directly from your own input text without having to record speech to convert to the target voice.

here are some pre-trained models I have made!

Here’s some content I made powered by this system!

Final Product!

Thanks for reading!!!
The next article will be about how to synch facial movements to a given audio!

--

--