Make Music With Artificial Intelligence (OpenAI Jukebox)

Christopher McBride
Analytics Vidhya
Published in
6 min readNov 19, 2020

The team over at OpenAI have been working on a neural net capable of spontaneously generating novel music, and it is currently considered state-of-the-art in this field. Their approach involves modeling raw audio waveforms — a departure from previous attempts that developed music symbolically (i.e. piano rolls). This has a number of implications for the results of the models’ output, the most important being the ability to replicate the human voice while singing. And they’ve made all of their code and the trained model publicly available so that we can try it out for ourselves.

The jukebox website can be found here, but for convenience you can click this link to open a Google Colab page with a python notebook containing all of the setup and example usage for generating music with the Jukebox model.

The notebook begins installing the code off of their GitHub page, checking that a GPU is being used and importing the necessary libraries.

!pip install git+https://github.com/openai/jukebox.git!nvidia-smiimport jukebox
import torch as t
import librosa
import os
from IPython.display import Audio
from jukebox.make_models import make_vqvae, make_prior, MODELS, make_model
from jukebox.hparams import Hyperparams, setup_hparams
from jukebox.sample import sample_single_window, _sample, \
sample_partial_window, upsample
from jukebox.utils.dist_utils import setup_dist_from_mpi
from jukebox.utils.torch_utils import empty_cache
rank, local_rank, device = setup_dist_from_mpi()

For the next part it is essential that you make a change to their notebook. If you want to save the music that you generate then you need to mount your google drive. To do that, create a new cell and copy/paste the following code.

from google.colab import drive
drive.mount('/content/drive')

When you run the code above, the output will contain a link that will ask you to login with your google account before providing you with an authorization code than you must copy and paste into the textbox output. Once you hit enter, it will tell you that you’ve successfully mounted your google drive. Now, all you have to do is change the ‘hps.name’ variable in the next code block to reference your google drive folder — this is shown below.

model = "5b_lyrics" # or "1b_lyrics"
hps = Hyperparams()
hps.sr = 44100
hps.n_samples = 3 if model=='5b_lyrics' else 8
hps.name = '/content/drive/My Drive/'
chunk_size = 16 if model=="5b_lyrics" else 32
max_batch_size = 3 if model=="5b_lyrics" else 16
hps.levels = 3
hps.hop_fraction = [.5,.5,.125]
vqvae, *priors = MODELS[model]
vqvae = make_vqvae(setup_hparams(vqvae, dict(sample_length = 1048576)), device)
top_prior = make_prior(setup_hparams(priors[-1], dict()), vqvae, device)

The next code blocks will specify additional parameters for your model. You can specify a sample_length_in_seconds which will determine how long the song you produce will be — note that very long songs will take a very long time to process and Colab has a 12-hour session timeout at their free tier, so it is recommended to stick with 60 seconds or less.

sample_length_in_seconds = 60hps.sample_length = (int(sample_length_in_seconds*hps.sr)//top_prior.raw_to_tokens)*top_prior.raw_to_tokens
assert hps.sample_length >= top_prior.n_ctx*top_prior.raw_to_tokens, f'Please choose a larger sampling rate'

Next you get to specify an artist, genre, and the lyrics that you want your artist to sing. To determine the available artists and genres that you can select from you can take a look at the Jukebox Sample Explorer.

Jukebox Sample Explorer — jukebox.openai.com

The search function allows you to browse artist’s and genre’s that the model has trained on; the model is specified in the left-hand column, depending on whether or not you opted to selected 5b_lyrics, or 1b_lyrics. And don’t worry, there is a line of code in this section that will warn you if your selection is not available. As for the lyrics, I’ve found it’s rather amusing to put in text pulled from a random text generator, just to see how the model handles it. But, if you’re looking to actually make decent music, you may want to put some effort into this step to make the lyrics contain language patterns that the model will have reasonably seen before.

metas = [dict(artist = "Death Cab for Cutie",
genre = "Pop Rock",
total_length = hps.sample_length,
offset = 0,
lyrics = """To shewing another demands to. Marianne property cheerful informed at striking at. Clothes parlors however by cottage on. In views it or meant drift to. Be concern parlors settled or do shyness address. Remainder northward performed out for moonlight. Yet late add name was rent park from rich. He always do do former he highly.""",),] * hps.n_samples
labels = [None, None, top_prior.labeller.get_batch_labels(metas, 'cuda')]

Next we get to some final parameters, including the sampling_temperature. This variable will adjust the randomness that will get incorporated into your song. A value of 1.0 means that you would like the model to stay very close to the exact musical patterns of the artist and the genre. They recommend a value of .98 or .99 to incorporate a degree of novelty.

sampling_temperature = .98lower_batch_size = 16
max_batch_size = 3 if model == "5b_lyrics" else 16
lower_level_chunk_size = 32
chunk_size = 16 if model == "5b_lyrics" else 32
sampling_kwargs = [dict(temp=.99, fp16=True, max_batch_size=lower_batch_size,
chunk_size=lower_level_chunk_size),
dict(temp=0.99, fp16=True, max_batch_size=lower_batch_size,
chunk_size=lower_level_chunk_size),
dict(temp=sampling_temperature, fp16=True,
max_batch_size=max_batch_size, chunk_size=chunk_size)]

Finally, we get to the code block that develops the high-level musical sample. By high-level, I refer to the way in which Jukebox generates music; Jukebox operates on three different levels of audio-quality, with the top level (level 2) working with the “grainiest” or least-detailed waveform, meant to identify the predominant features in the music before going on to generate level 1 and level 0 waveforms that add in the fine-details of the song through a process called up-sampling. Running this block of code inputs all of the parameters previously specified and generates our level 2 sample. They say it will take around 10 minutes per 20 seconds of music sample, but prepare for it to takes up to 20 minutes per 20 seconds.

zs = [t.zeros(hps.n_samples,0,dtype=t.long, device='cuda') for _ in range(len(priors))]
zs = _sample(zs, labels, sampling_kwargs, [None, None, top_prior], [2], hps)

After this section of code is done running, you will be able to access the song you just made in your google drive! It will be stored in a file called level_2 and it will sound very rough, but it will give you an idea if you like the general sound before you go through the very time intensive step of up-sampling your song. If you do decide to continue with the up-sampling process, the notebook will have you run a block of code to help clear up some of your memory, shown below.

if True:
del top_prior
empty_cache()
top_prior=None
upsamplers = [make_prior(setup_hparams(prior, dict()), vqvae, 'cpu') for prior in priors[:-1]]
labels[:2] = [prior.labeller.get_batch_labels(metas, 'cuda') for prior in upsamplers]

And finally, we get to the incredibly processing intensive, very time consuming single line of code which calls the “upsample” function. It is highly recommended that you run this line of code over night, as it can take several hours. However, the result of all that processing is your final product: a novel song. It may sound a little janky, depending on your input parameters, but it will be your janky. In the case that you wake up in the morning and find that your google drive does not contain a level_0 file with your completed song, you probably were disconnected from the Colab server. In this case, you can try reducing the length of the song you’re trying to produce or opting to try out a paid tier of the Colab notebook that won’t disconnect you.

zs = upsample(zs, labels, sampling_kwargs, [*upsamplers, top_prior], hps)

That’s it; the whole process is actually quite straightforward. Good luck in your musical endeavors and may your end result not sound like a demon as some of mine did.

--

--