MUSIC VAE — Understanding of Google’s work for interpolating long music sequences

12 min readMay 8, 2019

Motivation behind Music VAE:

When a painter creates a work of art, she first blends and explores color options on an artist’s palette before applying them to the canvas. This process is a creative act in its own right and has a profound effect on the final work.

Musicians and composers have mostly lacked a similar device for exploring and mixing musical ideas, but we are hoping to change that. Below we introduce MusicVAE, a machine learning model that lets us create palettes for blending and exploring musical scores.

The Variational Autoencoder (VAE) has proven to be an effective model for producing semantically meaningful latent representations for data. However, it has thus far seen limited application to sequential data, it was found out that existing recurrent VAE models have difficulty modeling sequences with long-term structure.

1.Traditional LSTM’s unable to decode long sequences like in Music due to posterior collapse problem.

2.Posterior Collapse : vanishing influence of the latent state as output sequence is generated

Solution :

Hierarchical Decoders:

•Sampled Latent vectors passed through multiple levels of decoder rather than a flat decoder.

Reduce scope of core bottom level decoder by propagating state only within each subsequence/bar as shown in the diagram below:

Dataset and Data Preprocessing :
MusicVAE model takes in MIDI files which is a widely used format for music.
Each musical sample is quantized to 16 notes per bar (sixteenth notes). To present some background on this :
In music 1 whole note represents 4 beats in time which equals 1 Bar, i.e 1 whole note = 4 beats = 1 bar
Then, an eighth note is played for one eighth of duration of a whole note, i.e 1 eighth note = 1/8th of 4 beats = ½ beat
Similarly, a 16th note is played for half the duration of an eighth note ,i.e 1 sixteenth note = 1/16 of 4 beats = ¼ beat
Now, since 1 bar has 4 beats ,it will take 16 such 16th notes to consume 1 bar of music.
For a 16 bar melody, MusicVAE uses a Bidirectional LSTM Encoder and Hierarchical Unidirectional LSTM decoders .

Encoder:

Input to Encoder:

The 16 bar music samples to be fed to the encoder can be represented as 3 dimensional matrix

[ batch_size, max_sequence_length, input_depth ]

Here,

Batch_size : is the number of samples during training which is 512.

Max_sequence length : is the maximum possible sequence length which is 16X16 = 256

Input_depth : the dimension of each note that is played . Eg. In a monophonic piano music sequence there can be 90 type of events at each time step (88 key presses , 1 release, 1 rest).

So from a 16 bar melody we can generate 90²⁵⁶ possible sequences.

In case of MIDI files the input depth per note is 4.For example each note consist of the following parameters:

notes {pitch: 69, velocity: 80, start_time: 1.25, end_time: 1.5}

notes {pitch: 66, velocity: 80, start_time: 1.5, end_time: 1.75}

Here, the pitch and velocity encode the note played on an instrument say a Piano.

This [512 X 256 X 4] input is fed into a Bidirectional LSTM encoder.

Encoder Configurations and Details:

The encoder rnn is a bidirectional LSTM of 2 layers.Each layer has a state size of 2048 which means there are 2048 hidden nodes in the fully connected layers that form each LSTM cell.

The input tensor of dimensions [512 X 256 X 4] which is first converted to time major (converting to time major is important before feeding to LSTM) with dimensions [256 X 512 X 4] is fed to the first layer of the encoder rnn. The output from 1st layer is passed to second layer which gives us the hidden states output in both the directions- (Ht- fwd ) and (Ht- bkwd ).

Each cell in the encoder receives a note from the sequence with respect to time .Hence there are 256 LSTM cells for a 16 bar melody input. So, the dimension of input to 1st LSTM cell would be [1 X 512 X 4] an so on.

Below is the code snippet:

Now we take only the final states(HT- fwd ) and (HT- bkwd ) vectors from both the forward and backward direction cells which are then concatenated . Reason for concatenating the two hidden states is because the latent feature should describe in both the direction .If one of them is used then only information from one side will be used which should not be the case for music interpolation

The dimensions of (HT- fwd ) and (HT- bkwd ) are [1 X 512 X 4] respectively. These concatenated states which are the encoder output are then passed through 2 different fully connected layers with softplus activation function .This gives us the latent distribution parameters mu and sigma to produce Multivariate Gaussian Normal distribution which represents the posterior distribution of each sequence. The mu and sigma can be represented as :

where W’s are the weight matrices and b are the bias vectors.This is a 512-dimension multivariate Gaussian distribution .

Unidirectional and Bidirectional LSTM Explained :

First we understand how an LSTM cell looks in a unidirectional LSTM network. Then we can expand to a Bidirectional LSTM network.

Generally an LSTM cell has 4 gates :

Forget Gate:
Input Gate
Selection gate
Attention Gate(optional)

Forget Gate : The very first gate in an LSTM cell, it decides which part of cumulated previous information held in cell state to forget.The equation is given by:

Input Gate : This Gate decides on how much of current input to be added so as to update the cell state.

The cell state after input and forget gate is given as :

Selection Gate: This gate selects which part of the current information to be output as:

Bidirectional LSTM cell:

Bidirectional LSTM’s are just 2 unidirectional LSTM’s which work in opposite direction of input sequence.

Encoder FLow Explained:

Encoder Code Explaination :

The user can run a command ‘music_vae_generate’ as given on Magenta VAE’s website(https://github.com/tensorflow/magenta/tree/master/magenta/models/music_vae):

Music_vae_genarate.py is a python script that defines various parameters like :

Mode : Sample or Interpolate

2. Num_outputs : the number of samples or number of steps to interpolate including endpoints

3. Checkpoint_file : the location where pre-trained model is

4. Output_dir : the location to save actual output.

5. Input_midi_1 ,input_midi2: input files to interpolate between.(Only for mode=interpolate)

The Magenta team has built a ‘ music ‘ library that converts the input MIDI files supplied by user into note Sequences as in below code snippet which can be found in run method of configs.py :

Lstm_utils.py :

This script has method cudnn_lstm_layer which builds an LSTM layer as per the parameters passed.

Layer_sizes : It is a list of number of units at each LSTM layer. In our case there are 2 LSTM layers each with unit size as 2048. However if you look at lstm_models.py, BidirectionalLstmEncoder method (explained in next section ) we call this method cudnn_lstm_layer one at a time for each layer.

Other Important method worth mentioning are get_final:

Gets the Final Hidden States to be concatenated

This method returns only the final hidden states index in the sequences. As for interpolation we are interested in generating latent space of endpoints we extract only the 1st and last note in each input .This is further concatenated and to form the encoder output

Lstm_models.py :

The main class of interest is the BidirectionalLstmEncoder.

The first method in this class is build which creates the bidirectional LSTM network:

Cells_fw : list containing the 2 LSTM unidirectional layers of forward network

Cells_bw : list containing the 2 LSTM unidirectional layers of backward network.

Note we are calling the cudnn_lstm_layer in lstm_utils.py as described above.

Next we discuss the encode method that does the actual encoding :

Encode :

As per the code ,we can see that the input tensor is first transposed to time major format.It is of [256 X 512 X 4] dimension which is fed to the forward lstm cells.For the backward cells the sequence is first reversed and then fed to the backward lstm cells.
We then take the final hidden state output from the forward lstm cells and the first hidden state from the backward lstm cells as the input tensor for it was reversed earlier.
Finally both the hidden states of dimension[512 X 4 ] are concatenated to form the encoder output [1024 X 4] to the two different fully connected layer .This gives us the parameters of the normal distribution as shown in the code below.

Generating mu and sigma for the Normal Distribution

4) Softplus is used as the activation function for the dense layer that gives us sigma values for the distribution.

5) This gives us the latent distribution parameters mu and sigma to produce Multivariate Normal distribution which represents the posterior distribution of each sequence.

Hierarchical Decoder:

Overview :

The paper specifies the use of an additional layer called conductor to learn longer sequences from the latent space. This conductor is nothing but an LSTM layer. The number of LSTM blocks in the layer is specified by hyper parameter depending on the type of music used. The latent variables obtained from the encoder is passed through a fully connected dense network with activation function tanh. This is then packed as two LSTM block cells which is used to initialise the first level of decoder LSTM inputs. Depth wise decoding is done to get the final outputs.

Architecture

LSTM Layer :

Length of LSTM layer specifies the number of LSTM Blocks present in each layer. This is set as an hyperparameter and differs from model to model. A single LSTM block in one layer has stacked LSTM each with 1024 units. Consider the given figure. For level_length = [16,16]. There are two levels. In the first level there are 16 LSTM blocks and in the second level there are 16 LSTM blocks for each block of level 1. Total blocks in second layer = No of blocks in layer 1 * number of blocks in layer 2 for each block in layer 1 = 16*16 = 256. Number of cells inside one LSTM block depends on the hyperparameter dec_rnn_size. In the figure shown below there are two stacked 1024 cells inside one block. In other words a multirnncell composed sequentially of 2 LSTM block cells of size 1024 each. The second layer is then connected to the output LSTM layer

Code Snippet :

Hierarchical LSTM creation in model build (lstm_models.py )

LSTM block cell creation (lstm_ustils.py)

Conductor :

The first level of unidirectional LSTM layer is what is called as conductor. Only the first block cell in the conductor is initialized with the embeddings from latent space. This sets the hidden and cell state. The input is initialised to zeros based on batch size. The batch size is set as hyper parameter. When batch size is 512 , a tensor of shape 512 * 1 set all as zeros is given as input to the first block cell. Given input, cell state and hidden state it provides the next cell state, hidden state and the output. The state values are passed on to the next LSTM

Code Snippet:

Intial state of conductor initialized from z (lstm_models.py)

Getting the initial state from z (lstm_utils.py)

Decoder:

Each of the conductor’s output becomes the input to the next layer decoder which is again an unidirectional LSTM blocks as discussed above. Based on the layer length specified (as hyperparameter) number of blocks to each conductor is created. States of first block cell becomes the initial state of the next block within the conductor only. There is no communication happening between the LSTM blocks of different conductors.

Code snippet :

Recursive decoding is done depth first. Inside recursive
decode function ,

num_steps = level_lengths[level] Eg : num_steps = 16 for level 1. Recursively call the bottom layers. Once returned from the bottom layers loop repeats for num_steps (lstm_models.ly)

Core decoder :

The outputs from the last level is given as input to the core decoder. Number of core decoder units depend on the number of units in the layer above as each of the lstm block in the previous level will be connected with one core decoder. The type of core decoder varies from model to model. The decoded outputs are then merged together to give the final output of interpolated music

Working :

This is based on the model “hierarchical trio”

Z of is passed to dense layer with activation function as tanh. The output shape is the flattened state size of the first LSTM block in the first level ( First Conductor). If the block has 1024 + 1024 LSTM units as mentioned previously the state size is c = 1024, h = 1024 for each of the two stacked LSTMs. Thus the flattened state size will be [1024, 1024, 1024, 1024]
The output shape of the dense layer is the sum of the flattened state size(4096). Thus the dense layer has shape as 512 * 4096
This is then split into sequence giving four splits(len(flattened state size)) each of shape 512 * 1024.
The four splits are then packed into two lstm state tuples. Each of shape c = 512 * 1024 , h = 512 * 1024. Where c is the cell state and h is the hidden state of the lstm
This becomes the initial state to the first conductor
Based on the initial state, depth wise decoding is done passing down outputs to the below layers and finally to the core decoder
The output from the conductor 1 becomes the input to the next level decoder (Core decoder ).
The decoder is again unidirectional LSTM like the conductor.
The states from decoder 1 is passed as initial states to decoder 2 to provide the outputs and states which are passed on to the next and so on
Once conductor 1 has completed the decoding , same procedure starts for conductor 2, 3 and so on up to the level length(16 in this case). Thus each of the conductors decode recursively going depth first and the conductors are independent of the core decoder outputs thus forcing it to utilize the latent vectors and thereby helping to learn longer sequences.
The output from all the decoders are then merged to give the final output. Below figure shows the decoding process
The reconstruction loss is calculated by comparing the final output with the actual output. Cross entropy is used
Adam optimizer is used for training

HVAE for Text generation:

Given the same HVAE model we were interested on how it will perform for text generation. We create an architecture similar to the above where instead of song sequences the inputs will be texts

References:

https://arxiv.org/pdf/1803.05428.pdf

MUSIC VAE — Understanding of Google’s work for interpolating long music sequences

Written by Music_VAE ub