State-of-the-art Singing Voice Conversion methods

Naotake Masuda
Qosmo Lab
Published in
8 min readSep 21, 2023

Introduction

Singing voice conversion (SVC) deals with the task of converting the vocals of singer A(source vocals) to sound like Singer B(target singer). Such technique allows for not just singing like your favorite singer, but also might allow you to morph and create new vocal timbre. In this article, we’ll cover the basics of singing voice conversion and explain the mechanisms of state-of-the-art models. We will not be going over how to run them on your PC in this article. For a demo of what these models can do and how to run these models, check out this excellent video by Nerdy Rodent.

Tutorial and demo of RVC, a popular software for singing voice conversion

SVC is similar to voice conversion (VC), which deals with normal speech, but SVC is focused on singing with pitch similar to the source vocals. Another similar field is singing voice synthesis (SVS) that deals with generation of singing voice from melody line and lyrics, similar to VOCALOID.

Measures of SVC performance

Let’s first consider what’s needed for a good SVC system. One goal is to increase the naturalness of the generated vocals. This includes sound quality, intonation and other factors that make the generated vocal sound like that of a real human. But a system may be effective at sounding natural, but also fail to capture the qualities of the target singer. So, another goal is to increase the similarity of the generated vocals to that of the target singer. Measuring these often require objective evaluation experiments with human listeners.

Parallel and Nonparallel SVC

Parallel (one-to-one) SVC refers to systems that can only convert a song from a certain singer to another certain singer. Such a system has limited use and it’s difficult to prepare a dataset of 2 singers singing the same song.

Nonparallel SVC is a more common approach. There are two types of nonparallel-SVC: any-to-one and any-to-many. An any-to-one SVC system can convert anyone’s source vocals to that of a target singer. On the other hand, an any-to-many SVC system can convert vocals by any singer to that of various target singers, with the singer ID often being inputted as a conditional vector. An example of any-to-many SVC system is FastSVC.

Most recent developments (as we’ll discuss later) in SVC tend to focus on any-to-one systems, with a single model being trained using a dataset of vocals for a single target singer. Multiple any-to-one SVC model can often be combined to create a any-to-many SVC system.

Cross-domain SVC

Sometimes, you may want to convert to a voice of a certain person who is not a singer. We must then train an SVC model with only speech data. This problem is referred to as cross-domain SVC. It is known that cross-domain SVC is more difficult, as speech voice doesn’t contain the higher registers that singing voice does. The same system can be used for both in-domain (trained on singing voice) and cross-domain SVC, but the recent Singing Voice Conversion Challenge 2023 showed that most systems tend to display lower naturalness and similarity for cross-domain SVC compared to in-domain SVC.

Disentangling Vocals

Vocals can be disentangled into several factors that lead to its generation.

The goal of SVC is to transform the vocals into that of another person while keeping the linguistic content, pitch (pitch is often transposed in case of male-to-female/female-to-male conversion), and dynamics (loudness). So we need to separate (or as we say in deep learning, “disentangle”) these elements somehow. In the case of any-to-any SVC, singer characteristics needs to be encoded as singer id (speaker A = 0, speaker B = 1, …) whereas in any-to-one SVC, the generator network only models the single target singer, so there is no need to worry.

The rest of the factors can be seen as singer-independent, and will be useful for SVC. Volume can be calculated as RMS, and pitch (or more accurately, fundamental frequency) can be computed by methods like Harvest (available in PyWORLD), CREPE (pytorch implementation), or PENN. However, it is often difficult to obtain a representation of linguistic information. A naive method would be to transcribe the source audio and get the linguistic content in the form of lyrics. However, some extra linguistic content (like intonations, delivery, etc.) is lost by this (there are also problems like time alignment between text and audio). Neural representation learning methods like HuBERT and ContentVec can be used for a more rich representation of speech.

Behind-the-Scenes of SotA SVC

Recently, singing voice conversion has had a “breakthrough” moment much like image generation did. Development of open source software for SVC seems to be especially active in Chinese-speaking communities and English documentation on how they work is sparse. We’ll go over the architecture of the models they use. If you’re not familiar with deep learning and or just not that interested, it’s fine to skip this section. Note that these frameworks update frequently and much of the architecture can change.

So-VITS-SVC

So-VITS-SVC is actually a combination of Soft-VC, VITS and neural source filter (NSFs). I’ll go over them one by one to explain how the whole system works.

Soft-VC

I previously mentioned that HuBERT can be used as a speech representation. A lot of previous research in SVC used a feature extractor network (like HuBERT) and then performed k-means clustering to obtain discrete speech units. This discretization is an effective way to remove the influence of the speaker from the speech units. But the authors of Soft-VC found that this discretization can actually destroy some linguistic information which results in mispronunciations. So, the authors propose to use a linear projection layer to get soft speech units which is trained to predict the distribution over the discrete units. All in all, Soft-VC improves the intelligibility and naturalness of the generated speech.

VITS

VITS is a state-of-the-art speech synthesis method. It is rather complicated so I can’t explain all the details. First, it utilizes variational autoencoders (VAE, see this article for an intro) to learn a probabilistic latent representation audio. It has been suggested that VAEs can improve the generalization ability to unseen conditions in SVC. Then, it uses normalizing flows to learn an invertible mapping between latent representation of audio and speech representations. At run time, speech representations are converted into VAE latents which is then decoded into audio. It also features a discriminator network (a la GANs) to improve the synthesis quality.

Neural Source Filter

Neural source filter (NSF) is based on the source-filter model of the human voice. Basically, the vocal cord produces a buzzing pulse sound, which is then filtered by the vocal tract. NSF produces audio in a similar manner. A sine wave is generated from according to the fundamental frequency. Then, harmonics are added using convolutional blocks. White noise is also fed into input to convolutional blocks instead of sine wave.

The important thing to note is that NSFs can only deal with sounds that have a definite, and singular pitch. Thus, it is ideal for voices but can’t be used for polyphonic audio like piano performances.

So-VITS-SVC architecture

Overview of VITS architecture. Red line follows the inference process (run time).

So overall, the system relies on a version of VITS that has been adapted for SVC. The flow is different for training and inference. The training process is like this: A variational autoencoder (VAE, left of the diagram) is trained by reconstructing the audio from the spectrogram. A VAE models the latent variable z of the audio. On the side, there is a feature extractor which extracts features important to SVC like pitch, volume and HuBERT embeddings. this is fed into the content encoder (separate from the VAE encoder) to output z’. A mapping between z and z’ is learned by normalizing flow.

The inference process is shown by the red arrows in the diagram. First, pitch (fundamental frequency), loudness (or simple volume in terms of RMS), and HuBERT embedding is extracted from the input speech. Then, these representations are fed into an encoder network. Using the VITS framework, an invertible flow has been learned between the encoder output and the VAE latent variable. Using this flow, we can calculate the latent variable and then feed it into the decoder network. The decoder uses a neural source filter model, to better model speech signals. The neural source filter also takes pitch as input, because the source generator requires pitch to generate a pulse signal.

RVC

RVC (which stands for Retrieval-based Voice Conversion) is very similar to So-VITS-SVC in terms of model architecture. One major difference is the “retrieval” part: it keeps an index of HuBERT embeddings calculated from the training data vocals and mixes that back in during runtime. RVC does this to better keep the singing style / delivery of the target singer in the training data.

Recall that HuBERT embeddings are supposed to be a representation of linguistic information in speech. But it’s known that singing style can affect the HuBERT embeddings as well, and this can be a problem for conversion. For example, if you wanted to convert your soft vocals into a death metal vocal, the SVC model would use HuBERT embeddings calculated from your soft voice. But this HuBERT embeddings is very different from HuBERT embeddings of death metal voice the model was trained on. As a result, the model might sound too soft compared to the death metal voice you were aiming for. RVC nudges the HuBERT embeddings to be closer to the embeddings of target vocals via retrieval and mixing, so that the singing style is kept.

Shallow Diffusion

Diffusion models have been the key for recent advancements in image generation, and they have been applied to the audio domain as well. However, using diffusion models for singing voice conversion has the benefit of improving naturalness, but is too slow to run in real time.

The typical diffusion model would take white noise as input, denoising it into vocals. The shallow diffusion technique instead takes as input a rough spectrogram generated by a fast network, only performing denoising for a limited number of steps. Basically, the diffusion model is only used for refining the outputs of another network. This shallow diffusion technique was introduced in DiffSinger, a singing voice synthesis system, and then introduced to SVC methods like DDSP-SVC, Diffusion-SVC and So-VITS-SVC.

Conclusion

It’s an exciting time for singing voice conversion, but there are several concerns around cloning someone’s voice. It’s all fun and games to make politicians sing silly songs, but professional singers often do not want their voice to be cloned. While there are some people selling models trained on their own voice, there are many unauthorized voice conversion models for certain people circulating on the Internet. We hope to bring SVC or a voice morphing effect to our VST plugin neutone sometime soon, of course, properly crediting the original singers and obtaining their consent.

Useful Resources

Repositories for SotA SVC

Some tools for running realtime voice conversion models

Cool new papers for realtime voice conversion

P-RAVE

  • A paper by iZotope that makes RAVE models work better for SVC.

Real-time Singing Voice Conversion Plugin

  • Another paper by iZotope, that powers the new SVC feature on their VST plugin Nectar.
  • This and P-RAVE work with very little latency and resources compared to RVC.

--

--

Naotake Masuda
Qosmo Lab

AI/ML Engineer at Qosmo, Specializing in music/audio domain. PhD@UTokyo