An AI Overview of Voice Conversion: From Model Training to Post-processing Techniques

Published in

abbeal’s tech blog

7 min readJul 12, 2023

Voice conversion involves transforming a source voice to resemble a target voice while preserving linguistic content. In this article, we explore the AI concepts and operations behind voice conversion, from model training to post-processing techniques. We’ll discuss various models used for speech conversion, the role of pitch information, and the significance of Convolutional Neural Networks (CNNs) in analyzing audio signals as well as how it works. Additionally, we’ll touch upon the importance of post-processing techniques in refining pitch estimation. Join me as we delve into the fascinating world of voice conversion and its AI-driven advancements.

To convert a voice in a voice clip or a song, the first needed thing is to train a conversion model. Several models, including Gaussian Mixture Models (GMMs), Hidden Markov Models (HMMs), and deep learning-based models like Variational Autoencoders (VAEs) or Generative Adversarial Networks (GANs), can be used for speech conversion.
In order to make the source voice sound more like the target voice, these models learn the mapping between the source and target attributes. The CREPE model (or a comparable pitch estimation model) is used to evaluate the source voice throughout the conversion process in order to extract pitch information. The source voice is then altered to match the target voice’s prosody or intonation using the pitch information as a guide.

Now that we have a brief overview of voice conversion, let’s delve into the various models used for this purpose:

Convolutional Representation for Pitch Estimation, or CREPE, is a pitch tracking algorithm that determines a sound’s pitch or frequency. Applications for manipulating audio and music frequently use it.
To operate, CREPE analyzes the audio signal and determines the pitch of the dominant or most noticeable sound. Here is a short description of how it works:

Convolutional Representation for Pitch Estimation

During preprocessing, the audio signal is divided into smaller segments known as frames, with each frame typically containing a few milliseconds of audio. The next step is feature extraction, where CREPE generates a spectrogram representation of the audio from each frame. A spectrogram provides a visual depiction of the frequencies present in the audio signal over time.

To analyze the spectrogram and identify patterns associated with different pitches, CREPE employs a specialized type of neural network known as a convolutional neural network (CNN). This CNN is specifically trained to process the spectrogram data and extract pitch-related patterns.

Let’s examine the convolutional neural network technology that underlies CREPE now that we have a better knowledge of how it operates.

Convolutional neural network (CNN):

Convolutional neural networks, a type of artificial neural network, excel at processing and analyzing visual input, such as images or audio spectrograms.

The design and operation of the visual cortex in the human brain served as an inspiration for CNNs. They use a technique called convolution to automatically learn and extract attributes from input data. Here is a brief description of how CNNs operate:

A convolutional layer receives the input data as input. This layer processes the input data through a number of filters or kernels. Every filter does a convolutional mathematical operation, which involves sliding across the input and multiplying the values in the filter by the corresponding values in the input. This procedure aids in identifying regional patterns or traits.

After the convolution operation, an activation function is applied to give the network nonlinearity. For instance, the Rectified Linear Unit (ReLU), a popular activation function, leaves positive values alone while setting negative values to zero.

Non-linearity must be added to a neural network in order for it to learn and reflect complex relationships in the data. Without non-linearity, the network would effectively be reduced to a linear model, which greatly limits its expressive power and ability to discern complex patterns.

Convolutional neural networks are powerful tools for audio processing, but it’s crucial to understand the role of non-linearity in unlocking their full potential.

Unveiling the Importance of Non-linearity in Neural Networks for Complex Data Modeling

Real-world data is frequently characterized by complicated and nonlinear interactions between the features of the input and the desired output. ReLU, sigmoid, and tanh are examples of nonlinear activation functions that introduce the capability to describe and approximation these nonlinear interactions. The network may learn and express increasingly complex patterns and correlations in the data by using these activation functions.

Representing a variety of features is made possible by non-linearity, which enables a network to recognize and represent a range of features at various levels of abstraction. Different components of an input may be more or less important, or they may make different contributions to the result. The network may selectively activate and amplify pertinent features while suppressing unimportant or noisy data with the aid of non-linear activation functions.

When there are interactions or dependencies between the input features, linear models struggle to tackle complex jobs. Combinations of linear input features can only result in linear outputs, which might not be adequate to capture the complex relationships in the data. The network can describe and approximate highly nonlinear mappings between inputs and outputs thanks to nonlinear activation functions, and thus avoid linearity constraints.

Moreover, neural networks may construct compositional representations of data because of non-linearity. Networks can learn hierarchical representations of features by mixing nonlinear activation functions with several layers. By combining simpler traits discovered in earlier layers, each layer captures increasingly complex and abstract features. This makes it possible for the network to represent and comprehend intricate concepts and data structures.

Deep Learning Layers for Powerful Neural Network Training and Predictions

In the process of training a convolutional neural network (CNN), several layers play crucial roles in extracting and transforming features. The pooling layer is responsible for downsampling and reducing the spatial dimensions of the data. By combining adjacent features and selecting the most significant ones, the pooling layer enhances the network’s resilience to input variations.

The fully connected layers, which come after the pooling layers, create connections between all of the neurons in subsequent layers, making it easier to find complex correlations between the collected variables. The output layer, the final layer in the CNN, generates the desired results based on the particular task, such as predicted pitch values in pitch estimation or class probabilities in picture classification.

In order to reduce the discrepancy between the expected and actual outputs, CNNs continuously change the weights and biases associated with each layer during the training process. Backpropagation is a technique that allows the network to learn from errors and tweak its parameters for better performance.

Optimizing Pitch Estimation with Post-processing

In order to determine the pitch for each frame and capture the most likely pitch of the dominant sound within that frame, the trained CNN examines the spectrogram of an audio source.

Post-processing methods can be used to improve these pitch predictions. By using information from nearby frames and adjusting the pitch trajectory over time, these techniques improve precision.

CREPE analyzes the spectrogram of an audio source using a convolutional neural network to determine the pitch or frequency of the dominating sound. It is a potent method that is applied in a variety of settings, such as music transcription, automatic tuning, and voice analysis.

After model training is complete, the trained model is serialized and saved to a file, maintaining the learned parameters, architecture, and other crucial information. This model file is preserved, and is then used for inference.

After post-processing has improved our pitch estimates, we can proceed to the following stage, trained model inference.

Efficient Inference with Trained Models: A Step-by-Step Guide

Using a pre-trained model for inference consists of using the model to make predictions or produce results based on fresh, unseen input. The process can be divided into the subsequent steps:

The pre-trained model is loaded into memory on the runtime environment so that it can be used to make predictions or generate outputs. The model is recreated using the settings and architecture that have been saved during training.
To ensure compatibility with the methods performed during training, new data that hasn’t been used during training is preprocessed. Depending on the needs of the model and the work at hand, this may involve scaling, normalizing, reshaping, or encoding the data.
Processing data input: The loaded model receives the preprocessed data as an input. The model then executes calculations and transformations according to its parameters and architecture.
The data goes through the model, layer by layer, updated by the stored parameters. When reaching the last layer, the model generates the required output, which, depending on the model’s purpose, might be labels, probabilities, numbers, structured outputs or, in our case, pitch informations.
Post-processing techniques can be used to enhance or understand the results after obtaining the model’s predictions or outputs. Depending on the particular needs of the application, this may involve thresholding, scaling, mapping, or converting the raw outputs into a more intelligible format.

By carrying out these steps, inference using a pre-trained model enables the use of learnt information to generate outputs and make predictions for new data samples.

Audio Synthesis in Model Inference:

If the model output is not a raw audio waveform sample, then additional processing may be required to synthesize or decode the audio. This can involve model-specific techniques, such as spectrogram inversion, vocoder-based waveform generation, or any other method. After being generated and possibly processed afterwards, the audio data can be waveform samples or a numerical array, which can then be used for additional processing or saved as an audio file.

Let’s review the key lessons learned and applications of AI-powered voice conversion methods after going over the entire speech conversion procedure.

AI-assisted voice conversion gives exciting possibilities for altering voices while keeping linguistic information. Using cutting-edge AI technologies such as pitch estimation and convolutional neural networks, we may get astonishing outcomes. However, in order to use these technologies responsibly, ethical issues must be considered. By embracing the promise of AI, we may alter communication and human-technology interaction. Future voice alteration has limitless possibilities, therefore let’s investigate it.