MusicGen from Meta AI — Model Architecture, Vector Quantization and Model Conditining Explained

8 min readJun 26, 2023

MusicGen or Simple and Controllable Music Generation is the latest paper from Meta AI which promises to generate exceptional quality music using a single language model architecture. It achieves this using an efficient token interleaving technique. Despite generating music for a max duration of 8 seconds, it has the potential to be conditioned by text or melody as prompts to control the output generated.

MusicGen architecture is built on top of the EnCodec project proposed by Meta in late 2022. On top of the encoder, quantizer and a decoder from EncoDec, MusicGen has additional conditioning modules to handle text and melody input prompts. This article dives deeper into MusicGen architecture starting from vector quantization, the decoder and the conditioning modules that integrate text and melody conditioning into the decoder.

Encoder

As MusicGen is built on top of the EnCodec architecture, lets start by understanding Encodec and learn what has changed. The end-to-end architecture of EnCodec has 3 parts namely encoder, quantizer and decoder. The encoder is fairly straightforward like in any other auto-encoder module and it has not been changed much in MusicGen. It is a standard convolutional architecture which churns out a vector representation for every frame in the input.

The next step after encoder is Residual Vector Quantization (RVQ) which is a special type of Vector Quantization(VQ). So we review VQ before explaining RVQ.

Vector Quantization

Simply said, VQ is the process of converting continuous or discrete data into vectors. The key goal of quantization is to get a compact representation so that we achieve data compression. There are 3 steps involved in the process. To understand it, lets take a look at 2 dimensional data with dimensions x and y. First, we cluster the given data and get the centroids of each cluster that are shown as blue dots. We can now put together all these centroids in a table. This resulting table is called a codebook. Needless to say, the more the centroids the larger the table has to be to capture all the centroid values. In this toy example, lets say we have 8 centroids. The number of bits needed to uniquely represent 8 centroids is 3 as 2 power 3 is 8. So we at least need 3 bits per second budget to compress this 2D data with 8 centroids.

Note that that the size of the codebook built this way depends on the target level of compression we need to achieve.

Limitations of VQ

Lets look at a practical example to understand the limitations of VQ. Lets say that the target budget is 6000 bits per second. We need to compress the input audio within this budget. Lets assume the input data is received at the rate of 24,000Hz and we are down sampling or striding this with a factor of 320. As a result, we get 75 frames per second. So, in order to achieve 6000 bits per second at the output, we have to allocate 80 bits for each frame. This enforces a constraint on the codebook that the size of the codebook or the number of centroids has to be at least 2⁸⁰. 2⁸⁰ is a very high number . Practically, each frame received at the VQ stage will come from an encoder network and so its dimension will be about 128 dimensions. And this complexity only increases if we want to improve the quality of quantisation. So in summary, the complexity of VQ increases in no time and so we have to use a more practical quantisation technique.

Residual Vector Quantization

The solution to the exponential data usage problem of VQ lies with RVQ. Here residual implies multi-stage quantisation. Instead of having one codebook, we can now have Nq codebooks where Nq is the number of quantisers and is a number we choose.

We have illustrated with 4 codebooks but in MusicGen implementation there are 8 codebooks. The idea is that we use the input and the first codebook to get the first quantised output from the first codebook. This output is then subtracted from the input and it gives the resudual for that stage. This residual is then passed to the next codebook to obtain its output and so on and so forth at each stage. One thing to note is that the number of centroids per codebook has reduced from 2 power 80 to 2 power 20. If we choose Nq as 8, then this number further reduces to 2 power 10 which is 1024. So we will get one output for each of the codebooks used. We can either add them up or use as a single output or we can process them in other ways. The way MusicGen addresses these is by using interleaving patterns.

Interleaving Patterns

Now that we get 4 outputs k1, k2, k3 and k4 per input frame from RVQ, there are different ways we can order them or interleave them before we feed them into the decoder. Lets look at the standard way of doing this which is simple flattening. For each time step t, for example t1, we get 4 outputs. We can be simply flatten these to form s1 to s4. Then we take the four outputs for time t2 and lay them flat to fill the sequence step s5 to s8. This way of simply laying everything in sequence is flattening pattern.

If we literally stack the outputs of all 4 codebooks one on top of the other for each sequence step s and for a given time t, we arrive at the parallel pattern. In the figure above, s1 has all the 4 outputs of timestep t1.

In case of delayed pattern, we introduce 1 step delay per codebook to indicate the order of the codebook itself.

We can also resort to the pattern introduced in the Vall-E paper. Here we prioritise the output of the first codebook for all the n time steps. We then switch to the parallel pattern for rest of the codebooks. For this reason Vall-E pattern takes twice the sequence steps size compared to the other patterns.

Codebook Projection and Positional Embedding

Lets take the flattening pattern example to learn how these codebook patterns are used. Lets say we are at sequence step s2. We first note down what codebooks are involved in that particular step. In the example in the above figure its k2. As we have access to the entire codebook, we can always retrieve the values corresponding to the indices. We sum the values for each sequence step to form the representation for that sequence step. Additionally, a positional embedding using a sinusoidal is also summed to each time step and the two summed values are passed to the decoder.

Model Conditioning with Text and Audio

The speciality of MusicGen is that there is the ability to condition either with text or with any melody like whistling or humming. If the conditioning is text, we use one of the three for encoding the text: First, we explore a pretrained text encoder T5 which stands for Text-to-Text Transfer Transformer that was published sometime in 2020 in this paper. Then we also try FLAN-T5 which is was released in this paper. As common sense suggests combining text and audio for conditioning would do a better job., we explore that option CLAP from this paper.

If we were to condition on melody such as whisling or humming, we should train with that information as well. For this, one option is to use the chromogram of the conditioning signal. The chromagram consists of 8 bins. When the chromogram is used without modification for training, it seems to overfit. So we suppressed the dominant time-frequency bins and leave rest of the data to be used for training.

Decoder

These codebook projections along with the positional embeddings are then passed to a transformer based decoder. There are L layers in the decoder with each layer consisting of a causal self-attention and cross-attention block. The decoder additionally takes as input the conditioning C which can either be text or melody . If its text, the cross attention block that takes the conditioning signal C after the text has been encoded by a standard encoder say T5. If its the melody, they pass the conditioning tensor C as prefix to the transformer input after being converted to a chromogram and preprocessed. As a result we get the output music that is generated, conditioned on the condition C.

And that is how MusicGen generates music conditioned of either text or melody. I hope that was useful.

Video Illustration

A video is available here, for those readers who might prefer a more visual explanation of MusicGen.

Unless otherwise stated all images are created by the author.

References

Jade Copet, Felix Kreuk, Itai Gat, Tal Remez, David Kant, Gabriel Synnaeve, Yossi Adi, Alexandre Défossez, “Simple and Controllable Music Generation” arXiv:2306.05284 [cs.SD], June 2023.

Alexandre Défossez, Jade Copet, Gabriel Synnaeve, Yossi Adi, “High Fidelity Neural Audio Compression” arXiv:2210.13438 [eess.AS], Oct. 2022.

Chengyi Wang, Sanyuan Chen, Yu Wu, Ziqiang Zhang, Long Zhou, Shujie Liu, Zhuo Chen, Yanqing Liu, Huaming Wang, Jinyu Li, Lei He, Sheng Zhao, Furu Wei, “Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers”. arXiv:2301.02111 [cs.CL], Jan 2023.

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, Peter J. Liu, “Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer” arXiv:1910.10683 [cs.LG]

Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Yunxuan Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, Albert Webson, Shixiang Shane Gu, Zhuyun Dai, Mirac Suzgun, Xinyun Chen, Aakanksha Chowdhery, Alex Castro-Ros, Marie Pellat, Kevin Robinson, Dasha Valter, Sharan Narang, Gaurav Mishra, Adams Yu, Vincent Zhao, Yanping Huang, Andrew Dai, Hongkun Yu, Slav Petrov, Ed H. Chi, Jeff Dean, Jacob Devlin, Adam Roberts, Denny Zhou, Quoc V. Le, Jason Wei, “Scaling Instruction-Finetuned Language Models” arXiv:2210.11416 [cs:LG]