Generating Ambient Music from WaveNet

Rachel Chen
20 min readDec 13, 2017

--

Stefan Bordovsky, Rachel Chen, Kyle Grier, Danny Sutanto

Introduction

In this post we will outline our motivation and approach for generating ambient music using Google DeepMind’s WaveNet, an audio-generative convolutional neural network. We describe how our team experimented with WaveNet by training on several categories of ambient music and include our insights on the generated results.

Motivation

Ambient sound is a fixture in everyday life that often goes unnoticed. While most people are probably painfully aware of the construction sounds booming, beeping, and banging from the new apartment building at 8am, they probably don’t always notice the gentle hum of a refrigerator at home, the clink of dishes at a coffee shop, or the soft roar from an interstate highway just down the road. Replace the unnoticed ambient sound with silence, however, and you dramatically change a setting’s character. Ambient sound seems to subtly, but significantly, alter the human experience, to the point that millions of people watch YouTube videos just to recreate the auditory experience of everyday activities like sitting in a coffee shop or pacing through a forest. Video game designers, too, have realized the importance of ambient sound to building an engrossing world: ambient sound and ambient music appear in most video games on the market, especially in the open-world genre (think Skyrim, the Witcher III, Goat Simulator).

Clearly, ambient music and ambient sound tracks have some value in modern-day society. According to PayScale, the average Sound Designer salary in the US is around $50,000. Our group members began to wonder: what if we could automate the process of ambient sound/music generation, redistributing these $50,000 salaries to our own bank accounts? Automatic ambient noise generation could be sold as a service, provided for free on a platform with ad-based income, or used as a building block for better generative models. Applications could range from study music to cinematic audio to music/noise for procedurally-generated video game worlds. Our team found all these uses for an ambient sound generator sufficiently motivating to take on the project for our Data Science Lab final project. We looked into Google DeepMind’s WaveNet model to implement our music generation.

What is WaveNet?

WaveNet is a deep generative model of raw audio waveforms. Training on a large number of raw audio samples (usually 16000–64000 samples per second) is a computationally-expensive task. Although RNN’s were successful in creating sound using encodings such as MIDI, a reliable method of training on large raw audio sample sets did not exist before WaveNet. WaveNet takes an unorthodox approach of using convolutional neural networks to make predictions influenced by all previously seen observations. This type of global context modeling is usually reserved for RNN networks but WaveNet saves previous knowledge using dilated convolutions. Dilated convolutions are capable of capturing global context and saves long term dependencies in the network. A simple explanation of this is that all previous inputs can have an effect on the next sampled data point. The figure below illustrates how inputs get combined to produce output and how some are skipped because of the nature of dilation. This is an example of performing convolution with kernel size of 3 (3x3). The first image shows that all neighbor pixels participate in calculating output. The second image takes every 2nd input (that is, dilate input) at a dilation rate (or factor) of 2. Third image takes every 4th, dilation rate is 4. Regular convolutions take every 1st (all) inputs so dilation rate is 1.

Dilated Convolution

By stacking dilated convolutional layers, one can increase the receptive field size, thus creating a deep neural network which includes many of the vast number of inputs as predictors of the final output.

Dilated Convolutional Layers

A good way of visualizing how dilated convolutions work for the 1D inputs is the network above. The number of hidden layers increases as the number of samples from an audio signal increases. In other words, the receptive field is increasing exponentially with the increase of hidden layers.

Causal Convolution

For reference on how causal convolutions differ from dilated convolutions, see the figure above. Dilation layers have a much larger spread of the initial inputs involved with a sample prediction. This provides the dilated convolution with a close approximation of a global context.

Due to lack of obvious quantitative measures of performance, we compared the generated samples to the the input audio fed into WaveNet using our human subjective sense of quality. This subjective measurement method can’t perfectly demonstrate how well we did in training the network, but it does show what the network can capture and gives a good estimation of the power of WaveNet.

Our Implementation

Below we describe how we collected data and used WaveNet to train on these samples and generate ambient music. Specifically we have cited the pipeline for which we used to generate sad ambient music.

Within ambient music are many different subgenres, of which we chose several to conduct our project on:

  • Sad
  • Fantasy
  • Fan
  • Happy
  • Forest
  • Cafe Noise
  • Bossa Nova

We discuss our success (or lack thereof) in training WaveNet on each subgenre. Additionally, in our experimentation we ventured outside the ambient sphere and tried our hand on training and generating Christmas music and Jazz Piano music. We have included our insights on the results and how they differ from those of the ambient outputs.

Data Scraping

To start, we first collected sound samples of each subgenre on which we wanted to train the Tensorflow-based WaveNet model. We utilized YouTube’s vast library of audio as the source of all our data. We manually searched for relevant ambient tracks, looking for upwards of 8.5 hours worth of unique (non-repeating) audio for each subgenre. To download audio from the results, we wrote functions in Python which leverage urllib (a library for formatting URL requests and parsing responses) and youtube_dl (a library for downloading video/audio from YouTube URLs). Our scraping code allows automation of the entire downloading process, yet we found that this automation was not ideal for all our data acquisition needs. On inspection of the various links from which our YouTube scraping code collected ambient audio, we found two issues that could negatively impact our dataset:

  1. Many of the videos were multiple-hour-long loops of the same, short audio segment.
  2. A majority of ambient music videos are compilation videos, and frequently share songs with other videos in the same category of ambient music.

We were concerned with the first issue because training with repeat data would, in a sense, “overfit” an instance of WaveNet during training, keeping the model from a generalized understanding of a music/noise genre. The compilation video issue also presents the same “overfitting” hazard, but creates the added risk of downloading audio with silence between clips. To deal with both the looped-audio and the compilation video problems, we manually vetted the videos we selected, searching chosen videos’ audio at random intervals, reading info text, and perusing comments to determine that videos we used contained unlooped audio, didn’t include songs we had already included in our dataset, and had minimal periods of silence. The issue of silence was mitigated by the Tensorflow’s implementation of WaveNet built-in functionality to trim silences via specifying a threshold value in training; this rendered silence in audio tracks as not hugely problematic. Another approach we used was to just not pick tracks that had gaps of silence.

Once we had downloaded wav files for our selected videos, we chunked our audio files into 30 second wav files using pydub’s AudioSegment utility. We chose to use 30 second chunks because we believed thirty second segments would preserve song structure while reducing our audio files to more manageable sizes (making file transfer and managing AWS instance space an easier task), but in retrospect we have little reason to believe that splitting the files into 30 second segments make much of a difference in performance. Using Linux’s scp command, our group members transferred our audio data to individual AWS p2x.large instances where we had cloned the Tensorflow WaveNet implementation. At this point, each member of our team began experimenting with our own datasets to see what generative results we could produce.

The samples we used were obtained by querying searches like ‘sad ambient music’ in YouTube and randomly selecting more than 8.5 hours’ worth of videos playing sad ambient music. The following table lists the YouTube videos we used to train our WaveNet model for sad music.

Selected YouTube Audio Samples for Sad Ambient Music Training

Once we had selected videos, we downloaded them as .wav audio files and split into 30-second samples using Python. The total duration of audio files we collected for the “sad ambient music” category was 9 hours 47 minutes, which is about 6.61 GB of audio files.

Training WaveNet

Once the audio samples were collected, we began training our model using a Tensorflow implementation of WaveNet. The following command was used to train our WaveNet model:

python train.py — data_dir ~/wavfiles/ — num_steps 150000 — silence_threshold 0 — max_checkpoints 300 — checkpoint_every 300

The training command above tells our WaveNet model to do 150000 steps at a default of 0.001 learning rate on all the .wav files contained in the directory “~/wavfiles/”, and not to ignore any of the .wav files even if they are perceived as mostly silent. It also tells the training python script to save the model as a ‘checkpoint’ every 300 steps and only keep a maximum of 300 checkpoints at a time in order to save memory. Additionally, we also modified the train.py training python script to save a model ‘checkpoint’ when the calculated loss for that step’s model was less than an arbitrarily low value such as 1.0. The model ‘checkpoints’ feature was convenient because it allowed us to terminate the training script at any point and pick up where we left off to continue the training our model.

For this project, our team was only able to complete 118500 out of the 150000 steps of training due to time and monetary constraints. We used Amazon Web Services’ p2.xLarge EC2 instance to train our WaveNet model with a GPU. The GPU allowed our team to train 118500 steps in approximately 3.5 days, with each step taking roughly 2.5 seconds. At that point, we were unable to continue training our model due to the growing costs of the AWS instance. We were also unable to continue training on our own laptops due to the lack of computing power that the model required to train at a reasonable speed (our laptops took approximately 1 minute just to train one step). Since the model required that we perform at least 20000 steps to generate something somewhat recognizable and around 80000 steps for something somewhat coherent, we were unable to continue training on our own computers.

Generating a Model

In order to generate .wav files of sad ambient music using our trained model, we used the python script ‘generate.py’ which was also provided in the same GitHub page with the Tensorflow implementation of WaveNet. The following command is an example of what we used to generate the output .wav files:

python generate.py — wav_out_path=generated20500.wav — samples 64000 ./logdir/train/2017–12–06T12–02–55/model.ckpt-20500

The command above tells our WaveNet model to generate 4 seconds ( — samples specifies how many audio samples to generate, 16000 corresponds to 1 second by default) of sad ambient music using the model checkpoint saved at step 20500, which is located in the directory “./logdir/train/2017–12–06T12–02–55/model.ckpt-20500”, and to name the file “generated20500.wav”. The python script ‘generate.py’ took on average 5 minutes to generate 4 seconds of audio in .wav file format on our laptops which ended up having a file size of 277kB. The fact that the training script saves model checkpoints while training allows us generate .wav files using the saved model checkpoint without terminating our training script, so we could train and generate files simultaneously. In order to get an idea of how the increasing number of steps were affecting the model, we generated .wav files at different numbers of steps and compared their outputs.

Sad Ambient Results

When we compared the output .wav files generated at different numbers of steps during training, we found that the output .wav files got better as the number of steps increased. The table below provides a brief description of what we observed in the output .wav files at different numbers of steps:

Observed .wav file output at various steps

From the table of observed .wav file outputs above, we get a general idea of how our model was training on the audio samples it was given. For example, we can see that in the first 600 steps, the model has yet to learn much about what sad ambient music sounds like. After 20000 steps, however, the model slowly starts to pick up simple, faint traces of pitch in the training samples. As the number of steps increase, the model starts to learn the more nuanced elements of the training samples and becomes increasingly more complex. We can see in the output .wav files that the model started generating faint, random melodies at roughly 40000 steps, which became increasingly complex and layered with multiple pitches and varying sounds/timbres as the number of steps increased. At 117780 steps, our model was able to produce something that resembled the sound of a flute playing simple notes in mild rain. It was interesting to note, however, that the model made only slight improvements after reaching a certain number of steps (ie. 40000 steps). Below is an example of a 4-second sample that our model generated at 117780 steps:

Fantasy Ambient Results

This data was trained using ambient music from the video game the Witcher 3. This music set can be found here.

This data was fed into WaveNet but this network was one of the few we trained wherein we changed the parameter sample-size to 64000. The network was trained for 14500 steps and had losses that oscillated around 2. The generated sample using this network is below.

Fantasty Ambient Generation

As can be heard, the WaveNet seems to capture a hint of the string noises used in the training samples. That is a promising result especially since it using a sample-size parameter value that is much less than what we usually used. Perhaps by limiting the value of the sample-size parameter and increasing the number of training steps, we could get a clear generation that is vastly more representative of the training set. Fantasy ambient is quite complicated in its music structure so the fact that WaveNet was capable of capturing decently representative sounds of the training set, speaks to the discerning power of WaveNet.

Fan Ambient Results

This data was trained using ambient sound taken from an hour long recording of a fan. The fan noise set can be found here. The network was trained for 13900 steps and had losses that oscillated around 1–2. The generated sample using this network is below:

Fan Ambient Generation

The interesting thing about this sound is it obviously doesn’t contain a type of melody or instrument noises like other genres we tried, but it comes out sounding much like the training examples we used to train it. We were expecting the final generated product from WaveNet to just sound noisy but it has obviously picked up on the idiosyncrasies of a fan sound. This was perhaps one of our best results of accurately replicating similar noises to the original training data.

Happy Ambient Results

Happy ambient music is another popular form of ambient music to listen to. It typically consists of positive sounding instrumentals, soft melodies, and uplifting rhythms. We decided to experiment with Happy Ambient because not only are tracks of this subgenre sought out (the videos we trained on had an average of 1.3 million views), but also because we wanted to evaluate WaveNet’s ability to handle more complex forms of music than the piano music it was trained on in this publication .

To find Happy Ambient music, we navigated through YouTube search results of the query “happy ambient music”. The playlist of videos we used can be found here. After the clips were downloaded as .wav files and split into 30s samples, we manually vetted the tracks for repeated sections, keeping only unique samples.

Then we trained on some 1000 samples of 30 seconds using this line.

python train.py — data_dir=/home/ubuntu/data/happy_ambient/ — silence_threshold=0.1

At infrequent intervals of training steps, we saved the model at that point and then we used that model to generate 3 seconds of “Happy Ambient” music. From these generated tracks we then chose to the best ones to generate longer 10 second tracks. The full list of our results can be found here.

It took several thousand steps for the generated music to sound like something beyond random clicks and a loud noise resembling wind. It created some decent results in the beginning.

Here you can distinguish a bit of piano riffing and the overall tone sounds positive. It got a bit better with more steps of training as you can hear through this track:

The notes are clearer and and less messy sounding. We hoped that with more training steps the results would be better but that was not the case

Nevertheless, we kept training and we ended training when we found this model generated at step 25650 which produced the best result of Happy Ambient music:

Happy Ambient was generally hard to generate with WaveNet. Due to its complexity of layered melodies, instruments, and beats, this training set was vastly different from the simple classical piano it was trained on in the aforementioned paper. However, we were able to generate tracks that captured happy and positive tones.

Forest Ambient Results

In the beginning we were training on nature noises and composed a collection of various nature sounds like birds chirping, flowing water, rain, and thunderstorms. However since an unproportionate portion of the training data was bird sounds, the resulting generated tracks created sounds that sounded like birds. When we made the training set evenly split between bird, water, and rain sounds, the model produced results that sounded more like water. We expected to have generations that would somehow encapsulate these various noises but realized that this was going to be quite difficult.

So, we pursued a different direction. Instead of creating a training corpus of disjoint sounds, we found tracks that had all tones we were looking for intrinsically layered in its structure. This was found in forest noises that featured bird, water, and wind sounds in both the foreground and background. The songs we trained on can be found here.

We trained on some 934 samples of 30 seconds using this line.

python train.py — data_dir=/home/ubuntu/data/happy_ambient/ — silence_threshold=0.01

A full set of the results can be found here. In the beginning the generated tracks were not reminiscent of forest noises, but resembled a synthesis of crumpled paper and a flowing river:

As the model trained over more steps, we got tracks that contained random shrills, perhaps distorting bird noises it heard in training, and long periods of loud gushing water. Some element of a flowing water was quite prominent across samples at different training steps. In this one you can hear a foreground of flowing/trickling water and soft bird chirps in the background:

With more training steps, the sounds of water were distinguishably clear, as heard through this sample:

Overall, results were more successful (resembled the training data more so) than that of Happy Ambient, probably because this style of music had less variance in its training data. The sounds of of a rainforest are more standard than the various kinds of Happy Ambient music the previous model had to train on.

Christmas Music

Given that Christmas songs have recently infiltrated the radio, department stores, and other sacred sound spaces (this report was written in December), we thought it was fitting to experiment on training WaveNet on Christmas music. The paper had generated English and Mandarin speech and classical piano separately, and given the lyrical nature of Christmas music, we wanted to see how WaveNet would fare in this genre space when the training samples would contain both elements.

We found about 9 hours worth of Christmas music and trained WaveNet on the .wav files procured from these YouTube videos.

The results were quite interesting, and depending on who you talk to, resemble Christmas music. The full list of the results can be found here.

Like the other subgenres, the beginning results of training were quite rough. The sound of loud gushing wind was quite prevalent in the tracks generated by WaveNet in this subgenre as you can clearly hear here:

With more training steps, you can decipher a jolly tone amidst the loud wind with faint voices in the background through this track:

The wind noise starts fade with more training steps. In this sample the tone of the music definitely sounds more holiday-ish (peppy and happy) and voices are more distinguishable:

However, as mentioned before, more training steps doesn’t necessarily mean that the models generated will produce better results. Wind sounding shrills and unintelligible noise were still generated at advanced steps of training. The end result we got was at step 57050 which yielded this sample:

While you can still hear a bit of a wind sound, the overall tune sounds jolly and holiday-ish, and the high pitched melody sounds a bit like the familiar song Rudolph the Red-Nose Reindeer!

Christmas music was definitely hard to recreate with WaveNet. Clear voices were not generated, but the idea of the overall holiday tune was represented by models at advanced training steps. It would be interesting to see what this model would generate with more training time and more training data.

Cafe Noise

Besides fan and forest ambient noise, our team also had a go at generating ambient coffee shop noise. We collected a few hours of coffee shop noise and set about training a WaveNet model, generating samples at occasional intervals and observing loss to monitor progress. To hear a progression of generation results, view the following playlist on SoundCloud.

The numbers within the range ~10,000 to ~60,000 refer to the number of steps WaveNet trained on cafe noise before generating the specific sample. For generated audio with step values in the 10,000s range, we heard mostly static noise when generating audio from a random vector. In the 20,000s range, the noise produced seems a little less like random static. In the sample taken at training step 20250, some of the noise sounds like a closing door. The 27400-step sample almost sounds like faint murmuring at times, though it develops into a whistling wind sound. Beyond this point, almost all generated samples featured wind sounds even more prominent than the 27400 sample.

WaveNet, it seemed, is not well suited with the current parameters for generating sound. The main trend in loss values over time while training on cafe noise showed convergence near a mean loss value of 2.7 (with variance to values as low as 1.8 and as high as 4.0). Comparing the quality of our generated samples and previous results from fan noise generation, we wondered how strongly loss values correlated to sample quality. Moreover, we speculated about what sort of audio might yield faster and better loss and quality convergence. The obvious difference between fan noise and cafe noise is sound complexity: fan noise is repetitive and consists of a single sound feature. Cafe noise, on the other hand, may be repetitive in the sense that common noises like glass-clinking and background chatter come and go fairly consistently, but the timing of these sounds is irregular, and the sounds themselves can vary significantly based off distance, sound source, and other factors. We inferred from our results that soundscape complexity factors into WaveNet’s loss convergence and generative success. Given our time constraints, we did not have time to attempt hyper-parameter tuning with the cafe noise dataset, so we instead pursued an investigation of sonic complexity and its effects on WaveNet generation.

Jazz Piano & Bossa Nova: A study of training data complexity

To conduct our investigation into sound complexity’s impact on WaveNet’s success, we selected jazz piano music and bossa nova (jazz piano with drums and bass) to control for sound genre, in a sense, while varying soundscape complexity. Launching model training sessions on separate AWS instances for jazz piano training data and bossa nova training data, we recorded the variation of loss over time and generated occasional samples to observe our models’ progress. Some results of this generation can be found in the two SoundCloud playlists linked below:

We made a few key observations while examining the differences between our jazz piano- and our bossa nova-trained models. First, the model training on jazz piano seemed to reflect key characteristics of the training data earlier than the bossa nova model, already capturing the sound of piano chords by 5700 steps. By the time bossa nova seemed to represent piano clearly (around 8450 steps), it was overlaid with noisy elements, perhaps due to the sounds included in a bossa nova drumset. Though generated samples suggested some elements of bass and even a drumset ride after 58100 steps of training on bossa nova, we did not see much improvement in sound representation beyond this point. We witnessed a quality plateau with the jazz-piano model, too, with the peak quality arriving after as few as 5700 steps.

As previously mentioned, we recorded loss data while training our two models. We have included plots of the loss value versus step number (see Figures 1 and 2). As steps increase linearly with time, the two plots shown below communicate the convergence of loss over time while training our models. For jazz piano, average loss quickly shrinks to below 2, and exhibits much smaller variance than the bossa nova model, which converges to a mean loss value of around 2.7. Based off our findings from our generated audio surmise that the difference in average loss values relates to the quality of generated samples, with lower losses roughly equating to higher fidelity sonic representations of the training data. Our results suggest that WaveNet much more reliably captures elements of sound from less-complex soundscapes for small amounts of steps. If this is the case, we wonder if WaveNet might be limited in its ability to generate complex audio soundscapes even at high step sizes. Given more time and AWS credits, we would happily set a few instances to run for long periods to get a better understanding of how WaveNet performs for high step counts.

Loss values from WaveNet model trained on jazz piano music
Loss values from WaveNet model trained on bossa nova music

Conclusion

In the course of this project we experimented with many factors: training with various kinds of ambient music, generating from models at different training steps, and tuning using WaveNet’s sample size parameter.

Here are our main findings:

  • More training steps does not necessarily mean that generated results will be better, but overall, the samples generated from models that had more training steps offered more complex and clearer tunes and sounds.
  • Training on complex genres (sad, fantasy, happy, Christmas, bossa nova) did not produce as high quality generated audio; WaveNet produced tracks that sounded more like the training samples when the input was simpler (fan, forest, cafe, piano).
  • Training, with each step averaging to be about 2 seconds long, took a while to train on our AWS instances (about 2–3 days) for us to get the results we did for each genre.

As Google DeepMind’s blogpost about WaveNet states, their speech-generation models were trained on audio and conditioned on speaker identity. For DeepMind’s music implementation, their models were trained on classical music and conditioned on instrument type. In this project, we experimented with more sonically diverse data of various audio genres, and found differing results based off of the different sound genres. We wonder what more time and more training resources would do if we continued training the WaveNet models on our ambient subgenres. Did we hit a generative quality plateau? Could longer-trained model produce better results? If we were to continue this project, we would like to investigate these questions, and we’d like to experiment with the following things:

  • Altering training sample size
  • Controlling for musical complexity
  • Controlling for training step size
  • Use transfer learning from pre-trained network

We would also have liked to use global conditioning on WaveNet with music generation. DeepMind “found that training on many speakers made it better at modelling a single speaker than training on that speaker alone, suggesting a form of transfer learning.” If this property is transitive, it is possible that a globally-conditioned WaveNet model trained on all our input ambient music could have performed better than our genre-specific models.

While we (or rather, our computers) are far from becoming the next big music producers, we still had a lot of fun generating music from WaveNet and hearing the interesting results it produced!

Thanks for reading! If you’re curious about any of our technical implementation or would like to try your hand at generation using our models, refer to our public GitHub repo here: https://github.com/sbordov/EE379K/tree/master/FinalProject.

References

--

--