Deep DJ: Musical Score Generation for Video

12 min readMay 12, 2018

Authors: Kaylee Burns, Vinitra Swamy, and Patrick Yang

Problem Statement and Background

Experienced filmmakers understand that the background music present in a scene can dramatically change or augment the sentiment expressed solely from visual elements. This is why large film studios and well-known directors frequently hire and contract the same Hollywood music directors (such as John Williams to George Lucas and Hans Zimmer to Christopher Nolan). However, amateur content creators, like YouTubers, often lack the capital to produce, commission, or purchase the rights to background music tracks. For our project, our goal is to develop a tool to effectively synthesize unique background music tracks that are consistent with the sentiment of the backing visual medium. Such a tool for generating music scores would greatly benefit the quality of the final product for amateur content creators at low cost.

In this report, we will discuss how we can exploit the setting of amateur content creation to remove many of the difficulties of raw audio generation to automatically create new, unique scores for frames as we do below:

We will also review the process we undertook in designing an automated music generator, as well as the challenges and difficulties experienced. Specifically, we will discuss the inherent technical difficulties in raw music generation and dual-medium sentiment embeddings, as well as the indefinite nature of metrics for generated audio that lead to ambiguity when evaluating our results.

Background: Raw Music Generation

We determined from our initial experimentation that an end-to-end model that generates raw audio files directly from videos would be unrealistic given our time and resource constraints. If you’ve encountered examples of music synthesis using Recurrent Neural Networks or Hidden Markov Models, you may be surprised at our doubt. Indeed, these approaches can produce very compelling results.

Those models rely on discretized formats of music, like MIDI files, as opposed to the more common but significantly more obfuscated WAV or MP3 file formats. Because we didn’t want our model inputs to be limited by this file format, we explored raw audio generation. Here, results are less promising. A state of the art approach like WaveNet can produce a convincing piano score, but results for other genres with multiple instruments at once show opportunity for improvement:

Results from WaveNet trained on piano music.

Results from WaveNet trained on ambient music.

These two examples are from Piotr Kozakowski and Bartosz Michalak’s blog post on WaveNet. We will revisit the second example during our evaluation of our neural synthesis module, Deep DJ, later in the report.

Background: Aligning Sentiment Analysis

Existing research has explored developing sentiment-based analyses for various domains, including text, audio, and video. However, most previous work has analyzed sentiments within a single medium without regards to multi-domain applications, and no work to our knowledge has examined matching based on sentiment embeddings trained across multiple distinct sets of media.

Instead of attempting to naively use sentiment analysis from different domains together, we propose a unified embedding trained across both domains. We believe that this embedding model will be more flexible than multiple disjoint models, as such an embedding would derive information from both domains at training time. This would allow for more accurate training in concept, as the final model has reference to both domains for which it is trained.

A Refined Problem Statement

Given these challenges, we refined our problem statement to focus on 2 key questions that stand between the current findings and a “virtual Hans Zimmer”:

Can we match audio and video based on high-level, subjective concepts, such as sentiment?
Can we improve existing methods of novel music synthesis?

Dataset

One option we considered during development involved extracting background music from videos, attempting to predict the audio directly from the frames, and developing a loss function based on a distance metric between the generated and target songs. However, this is conceptually flawed because multiple songs could be appropriate for the same video. Furthermore, what is appropriate for a particular video could change depending on the emotions the content creator is trying to solicit.

Motivated by this reasoning, we chose to collect a disjoint and unrelated pair of visual and audio datasets, both augmented with emotional annotations. As a proof of concept we worked with images because finding such a pair of disjoint datasets yielding similar emotion labels proved difficult, and we reasoned that the training and inference time for images is much faster.

Visual Datasets

We use the Image Emotion dataset: photos of abstract and artistic images were given multiple emotion labels (anger, disgust, fear, sad, amusement, awe, contentment, and excitement) by a number of human annotators. Here are some examples from the dataset:

Notably, the dataset contains artistic, heavily edited images, which may have introduced unexpected artifacts into our results.

Audio Dataset

We use the GEMS audio dataset, which consists of 40 song excerpts across 4 genres. Each song was annotated with up to 3 categories on the GEM (Geneva Emotional Music) Scale (amazement, solemnity, tenderness, nostalgia, calmness, power, joyful activation, tension, and sadness). Below we provide some examples from the dataset:

Electronic music with the tension annotation

Pop music with the calmness annotation

Other Dataset Considerations

We considered using the Youtube 8M dataset for video data and the Million Song Dataset for audio data. However, upon redefining and narrowing our broad problem statement into a music synthesis task as opposed to a raw music generation task, we chose to focus on the data repositories above.

Evaluation

To evaluate our models with regards to the questions in our problem statement, our metrics will measure both the ability of our model to correctly identify and match audio and video emotion annotations, and the level of maintained quality of generated music compared to the source tracks. We will use (1) Percent Recognition, based from the recall of selecting songs of the correct sentiment label given a video, and (2) Audio Spectrogram Quality, based from audited comparisons between source and generated audio spectrograms, for each half of our project, respectively.

Our Approach: Deep DJ

Data Preparation

Emotions are difficult labels to work with because of their high subjectivity and inconsistency across tasks. Notice the emotion labels in the GEMS and Image Emotion Dataset don’t correspond directly. Because our model relies on comparing visual and audio data with these annotations, we devise a mapping between the GEMS emotion annotation and the Image Emotion Dataset grounded in psychology theory. To simplify the problem, we assign a single emotion label to each data point (images or labels) by summing across human annotations and selecting the label with the maximum score

Baseline

Our baseline model uses the emotion annotations to randomly select two songs of the same label. We use neural style transfer for audio as described by Dmitry Ulyanov to combine these songs into a novel track for that image. We choose to randomize song selection so that users could continue to query the model for frame-relevant songs — this is the serendipity of a non-deterministic system!

The biggest drawback of this model is that it requires emotion annotations at inference time. However, as we will show later, it alone is a powerful proof of concept for novel music generation in this setting: to create music we don’t need to generate audio from scratch when we have access to a library of royalty free songs that we can combine in novel ways.

Final Model

For our final model we experiment with modifications of song selection and music combination. Specifically, we develop a more informed distribution over our library of songs that doesn’t require annotations at inference time and we use a larger, more complex, pretrained model to extract embeddings for neural style transfer for audio.

Sentiment Aligned Embeddings

To avoid using emotion annotations at inference time in our final model, we create image and sound embeddings that are aligned on sentiment. Our aim is to develop embeddings such that the dot product between an audio feature and an image feature is high if they both have the same emotion label. The dot product provides a score for each song in our library that we can use to sample for input to deep DJ. This is in contrast with our baseline model, which sampled uniformly at random among songs with the same annotation.

We train using a Noise Contrastive Estimation (NCE) loss: at each training step, we randomly sample one song with the same emotion as the input image and k songs with different emotion annotations.

Our image embedding uses an Inception V3 architecture and we initialize the weights from a model pre-trained on ImageNet. Similarly, our sound embedding uses a SoundNet architecture and we initialize the weights from a model pre-trained on the audio from over 2 million videos.

Deep DJ

The baseline model applies style transfer loss on the audio features extracted from a 1 layer convolutional network with random weights. In our final model, we optimize the style transfer loss using audio features extracted from SoundNet.

Results

From our experimentation, we determined that generation of raw music of a reasonable quality using neural style transfer was possible, but aligning image and audio embeddings based on sentiment would require more experimentation.

Results: Sentiment Alignment

To our knowledge, no prior work exists on generating embeddings aligned on sentiment in our use case. Because of the subjectivity of the task, we had to be creative with our evaluation. We developed the Percent Recognition metric for this purpose. For each image in our test dataset, we randomly select k songs with a different sentiment label and 1 song with the same sentiment label. The Percent Recognition metric measures the percent of test images that successfully select the song with the same sentiment label ( i.e. the dot product of the image embedding and sound embedding with the matching labels is greatest).

Our results are averaged over 10 different random initializations each trained for 1000 iterations. We compare against a random baseline for k = 3 noise songs:

Baseline: 25% | Sentiment Aligned Embeddings: 32%

For the art images in the Image Emotion dataset we save 10% for a validation split and 10% for a test split. The model reported in our results uses a learning rate of 1e-3 and a batch size of 32. Unfortunately our results are not much better than baseline. However, we believe that using a larger, less edited dataset could lead to more promising results.

We analysed examples of the audio preferred and not preferred by particular image embeddings. We found that there were no egregious errors and often there was a strong case for audio files selected that did not have the correct emotion label. The performance was likely harmed by our choice to only use the most common emotion annotation for a particular audio track. An interesting elaboration of our model would be to attempt a similar framework that allows multiple emotion labels for a single music source.

Results: Music Synthesis

To analyse the quality of our music generation, we performed an experiment to compare the outputs for WaveNet, Neural Audio Style Transfer, and Deep DJ (style transfer + SoundNet Embeddings) across 3 music metrics.

First we compared music generated by Wave Net, Neural Style Audio Transfer, and Deep DJ. WaveNet was trained on ambient music and Neural Style Audio Transfer and Deep DJ were given the following sounds to synthesize:

Conent input for Deep DJ and Neural Style Transfer

Style input for Deep DJ and Neural Style Transfer

Qualitative Results

We present our virtual composers below:

WaveNet trained on ambient music

Neural Style Transfer on above input songs

Deep DJ on above input songs

Harmonic-Percussive Spectrogram Analysis

We analyze the Harmonic and Percussive components of each track separately, splitting apart the Mel Power Spectrogram into two parts. This helps us understand which specific percussive and harmonic features of the two input songs are displayed in the output — without analyzing these two spectrograms separately, it is hard to conclusively attribute features to a certain song.

Harmonic-Percussive Spectrogram for Wave Net

While there is some periodicity to the percussive element of the spectrogram, there are no apparent patterns in the music generated by WaveNet.

Harmonic-Percussive Spectrogram for Neural Style Transfer

For the music generated by Neural Style Transfer, there is a pattern to both the harmonic and percussive elements of the spectrogram. The percussive element in particular has sharp, well-defined bursts of energy.

Harmonic-Percussive Spectrogram for Deep DJ

Deep DJ produces a lot of high frequency noise that lacks an interpretable pattern. This was a surprise to us because it can take advantage ofhigh level music information that vanilla Neural Style Transfer does not have access to.

Chromagram (Pitch Class Features)

We use the harmonic component of the spectrogram (in order to avoid pollution from transients) to generate a chromagram for each sample, representing the energy in each chromatic pitch class as a function of time. These give us a way to compare the pitch differences between generated output and original input. Brighter regions on the chromagram correspond to a greater intensity at that pitch class.

Chromagram for Neural Style Audio Transfer

Notice how in WaveNet there is no regularity or pattern in the pitch classes.

Music from neural style transfer has sharper and more defined notes, indicated by the long streaks of high activation in one pitch class.

To our surprise, Deep DJ produces music with pitch classes that are less well defined than WaveNet.

We were surprised by the poor music generation quality of Deep DJ. Especially considering the baseline style transfer model uses a single layer convolution of random weights! Baseline style transfer is reliably better across inputs, so we think it would be interesting to explore why such a simple model is so successful.

Tools

Our code was implemented in Tensorflow and built on top of the SoundNet and Neural Style Audio libraries. We utilized the publicly available Tensorflow retrain code for fine tuning image features.

For audio processing and analysis, we used scientific computing libraries like NumPy, MatPlotLib for visualizations, and the audio processing library Librosa.

Lessons Learned

When we first embarked on this project, our goals were lofty and our expectations high. As we became more aware of our time and resource constraints, we were forced to narrow down the problem to the most important questions. Learning how to break down a problem into accomplishable tasks was the most important lesson we learned from this project. Even though there is still a lot of opportunity to improve the quality of background music generation, our project answers key questions in accomplishing this goal.

Our work demonstrates a promising foundation for which future work may improve on. In particular, although our Percent Recognition rate ended lower than we would have desired, we’re confident that such results could be greatly improved simply by expanding the amount of ground truth data for training. Alternatively, our results with Deep DJ demonstrate a vast improvement over compared results from WaveNet both visually, from the spectrograms, and audibly, through the quality of sounds on the synthesized clips, suggesting that our model may be a better fit for synthesis over music samples.

One possible opportunity to further explore domain-specific music generation could involve exploring a multi-label setting. Our model currently relies on a single unified emotion label for each item in both domains. As mentioned before, both images and audio could yield multiple emotion annotations, and it would be interesting to investigate generating an embedding from melded emotions.

Team Contributions

Kaylee Burns: 60%

Cleaned data. Implemented baseline, sentiment aligned embeddings, and neural style audio transfer with SoundNet embeddings. Experimented with pipeline for end-to-end audio prediction using YouTube 8M dataset. Wrote final report, created poster, worked on slides.

Vinitra Swamy: 30%

Researched and identified datasets. Investigated music generation metrics. Created music analysis plots. Experimented with modifications to neural style transfer. Worked on slides, edited final report.

Patrick Yang: 10%

Assisted with poster creation, edited final report.

Please visit our github page: https://github.com/vinitra/music-score-gen.