Artificial Intelligence Music Generation: Melody Harmonization

Part 2: Feedforward neural network architecture and processing

7 min readSep 15, 2022

This is part 2 of a series on how to generate music with neural networks using bach choral melody harmonization. If you haven’t read part 1 of this series, I highly recommend reading it before.

In this part of the series, we will be looking in further detail at how the neural network is architectured and the inner workings of the neural network.

Neural network training objective

The objective of this network is to be able and take a soprano melody and generate the corresponding ATB voice’s harmonization. In order to be able and accomplish this task, we will be using neural network classification. Or more specifically a variant of neural network classification that I like to call time dependant multiple output classification.

You can think of classification as when given data, the neural network outputs a predicted class. More specifically it gives a certain score for each possible class and the class with the highest score is then the assumed predicted class for this input data. For example, with the mnist dataset, when given the image input of a 4, the network will input a score for each of the classes (0 to 9) and hopefully, the score attributed to the 4’s class output is the highest.

In the case of our harmonization neural network, we will be taking the following input:

Matrix one hot encoding of jsb choral music sequence

Where each row corresponds to the note of a given time step. For each of these time steps, the goal is to output scores of possible notes for the ATB voices. These multiple outputs are actually calculated at the same time and form the following matrix for each of the ATB voices:

Neural network output for a single ATB voice

Where the column and rows have been transposed. So now the content of a column contains a score attributed to each of the possible notes for a specific voice and the rows represent the notes at each time step.

Feedforward neural network

Before explaining the details of the feedforward neural network if you are new to machine learning and AI, I’d highly suggest reading this article at least to understand the core concepts and terms used for neural networks.

The feedforward network presented in the previous diagram has the following characteristics:

The data flows in the network from left to right.
The network has multiple layers each grouped by a different color.
All neurons contained in a certain layer are fully connected to the neurons of the following layer.
The data provided in the preprocessing phase can be of multiple forms but has to end up as a vector with as many elements as there are input nodes in the network.
Multiple hidden layers are employed and those layers do not have to contain exactly the same number of neurons.
The results of this particular feedforward network are outputted to multiple output layers as is the case of the model used for harmony generation.

Current model network architecture

Now that we have explored the high-level view of a feedforward network, the previous code is our implementation of that network using PyTorch. Here is how the different elements fit into the feedforward neural network architecture.

For the preprocessing phase, the time distributed one hot encoding matrix containing the notes of the soprano voice is passed as the x variable of this model. Since it is in the matrix form, the torch.flatten method transforms the matrix into a vector. More on that process is in the following section. This vector is then passed to the input layer.

The hidden layers can be divided into two phases. The first phase is where the data of the soprano voice is passed to two consecutive hidden layers. Each of these layers consists of a fully connected hidden layer, a relu activation function as well as a dropout function to randomly disable certain connections during training.

The second phase is the pre-output section of the different ATB voices. Each one of the voice has its own connection to the previously hidden layers and have a separate activation layer followed by a reshaping layer as well as a log softmax activation layer. The results of those final activation layers are the model outputs.

Data processing and tensor shapes

In order to have a better understanding of how the model works, I will now try to show how the data transits into the model. Most of the time people use mathematic notation to illustrate this point, but I’m not good at writing mathematic notation so I will be using diagrams instead. Please note that I’m purposely excluding the batch dimension in my examples for simplicity.

Given the initial time distributed one hot encoding matrix of the soprano voice below:

The following matrix is transformed into the following vector form by the torch.flatten function:

This is then the input vector of the nn.Linear input layer.

The following hidden, relu and dropout functions can be represented by the following diagram:

Where the previous nn.Linear input layer is passed to another nn.Linear hidden layer and the sum of the inputs of the neurons of that hidden layer pass threw the F.relu activation function. The final output of the neuron's activation is then passed through the nn.Dropout function which will determine if the neurons will be disconnected for the following layer or not. This process is repeated twice.

For the ATB voice pre-output layers, the vectors from the final dropout function are then reshaped as matrices to take the following form:

Where the P and N dimensions are inverted to satisfy the conditions of the F.NLLLoss loss function discussed in the next section.

Finally, the F.log_softmax function is applied to the previous matrix in order to score the notes in each column by the probability of being the note to play.

Loss function and network learning

In order to properly understand how the network calculates the loss used to adjust the network weights, I find it useful to look at the shape documentation provided by the PyTorch NLLLoss function:

PyTorch NLLoss function Shape documentation

At first, this looks quite intimidating, but with examples, it’s actually not that complicated. Simply stated this says that for an input of the following form:

And a training dataset target of the following form:

Target training data

The NLLoss function seeks to maximize the score of the element in the corresponding row specified by the class index of the target training data.

So in this specific example, for the first value of the training dataset which is 9, the goal for the model output matrix first column and 9th row is to have the highest value of that entire column. The other values in that specific column should have ideally very low values. Then the same process is repeated for each item in the training data and the model output columns.

As for how this loss is calculated, I refer you to the following article which does a great job explaining the negative log-likelihood function and also covers the softmax function.

Conclusion

In this article, we have seen in greater detail how neural networks can be used to generate harmony for a melody following the bach chorals harmonization technique. I hope that after reading this article you have a better understanding of neural networks and their possible usage for music generation. If you have other questions or related subjects that you would like me to write about don’t hesitate to post in the comment section!

About Me

I am the owner and founder of Kafka Studios where we build music plugins and tools powered by Artificial Intelligence.

I also compose video game-styled music in my free time.