Artificial Intelligence Music Generation: Melody Harmonization

Part 1: Data processing and overall architecture

6 min readSep 15, 2022

Are you interested in neural networks and music generation, but you don’t know where to start? In this article, we will explore the “hello world” of neural network melody harmonization where you will be able to take the following melody:

And use a neural network to harmonize it and obtain the following result:

This article explains the mental models as well as the data processing used in the following GitHub repository.

Please note that the code used for this article is in the release/1.0.0-medium-article-pt-1–2 branch and not in the main branch.

What is Bach chorale music harmonization?

The concept of harmonization is primarily to add different notes to play at the same time of the melody. These extra notes usually give more character to the melody and can greatly affect the mood.

What's more specific about Bach's chorale harmonization is that it’s a set of compositions made by Johann Sebastian Bach that follows these principles:

Each song consists of four voices (simultaneous notes played by four different music players)
These voices are named the SATB voicing (Soprano, Alto, Tenor, and Bass)
The soprano voice’s part is composed first and has the highest-pitched notes.
The alto, tenor, and bass parts complement the soprano voice by harmonizing it.
From left to right, the notes contained in the left voices of SATB are higher than the notes contained in the voices that are right of that voice.

Example of a Bach chorale with SATB voices

There are many more rules and music theorie used for composing Bach chorales and if you’d like to know more you can consult the Rules of Chorale Harmony.

Dataset

For this machine learning model, I’m using the JSB Chorales Dataset which has already been preprocessed and stored in a python pickle file from the following GitHub repository. I have been using the “jsb-chorales-16th.pkl” file as my training dataset.

The pickle file contains three lists of tuples named train, test, and valid which contain 229, 77, and 76 bach chorals each. For simplicity, all of those lists' content have been mashed up together as the training data. The content of one of the bach choral’s looks like this:

[(74, 70, 65, 58), (74, 70, 65, 58), (74, 70, 65, 58), …, (77, 70, 62, 55)]

Where each input in the tuple corresponds to one note in the SATB voicing and each tuple element represents the notes that are on for the duration of 1/16th of a note.

Music Encoding Strategy

Matrix one hot encoding of jsb choral music sequence

After loading the data from the pickle dataset, this data is further processed to take a matrix form. Each SATB is distributed into its own matrix using time distributed one hot encoding. The formed matrix has P columns which contain the note that is to be played and N rows representing each position in the musical piece divided into 16th notes time steps. The value of P depends on each of the SATB voices notes (22, 22, 22, 28) and the value of N is 64 representing 4 measures of 16th note segments.

The following assumptions can represent the content of one row:

If the first value of the row is equal to 1, then no note for that voice is to be played at this position in the song.
Any other position in the row where a 1 appears corresponds to the note that will be played going from the lowest possible note of the voice’s range to the highest.
Notes are mapped to an index via a range class where notes of a voice, for example, soprano notes 61 to 81 correspond to a position from row indexes of 1 to 21.
Only one position in the row is to be a 1 all other 0s

It’s worth noting that for the model input, the S voice is represented in this matrix form. For the training data, the ATB voices use the following vector representation:

ATB vector form

Where the number in each of the items represents an index of either silence or a note instead of having a 1 in the corresponding row position.

This form is used for the training data and is not to be confused with the model output form which will be further discussed in part 2 of this article.

Machine Learning Model

Feedforward neural network input and outputs

The neural network model used to generate the harmony is a simple feedforward neural network containing multiple fully connected layers. The input of this network is the soprano voice and the alto, tenor, and bass voices are generated from this input. For a deeper explanation of the inner workings of the feedforward neural network please consult the second part of this tutorial.

Generating music from the model

Once the model has finished training, new data can be generated from the model. To generate a new bach chorale these steps are followed:

A random soprano melody from the dataset or a new soprano melody is provided as the input of the model.
The ATB voices are gathered as the model outputs and grouped with the soprano voice.
All the SATB matrix data is compiled to regain the list of tuples form. Ex: [(74, 70, 65, 58), (74, 70, 65, 58), (74, 70, 65, 58), …, (77, 70, 62, 55)]
The tuple data is then transformed into a series of NoteInfo objects that will then be outputted to a midi file via a custom midi generator that accepts lists of NoteInfo.

The process of transforming the SATB tuple to lists of NoteInfo is pretty much straightforward except for the length property. The way this property is calculated could be explained by the following example:

[(74, 70, 65, 58), (74, 70, 65, 58), (74, 70, 60, 50)]

Where the same note numbers at the same subsequent tuple elements are combined to form 1 note instead of multiple notes. The previous example would yield the following NoteInfo(starting_beat, pitch, length)

NoteInfo(0, 74, 0.75), NoteInfo(0, 70, 0.75), NoteInfo(0, 65, 0.5), NoteInfo(0, 58, 0.5) and

NoteInfo(0.5, 60, 0.25), NoteInfo(0.5, 50, 0.25)

Conclusion

I hope this article gave you a better understanding of how to use neural networks to harmonize a melody using the bach chorals approach. If you have further questions or would like some concepts to be further explained don’t hesitate to post in the comment section.

If you want to dig deeper into the neural network model’s architecture and get a better intuition about the data manipulation is performed inside the feed-forward neural network you can read part 2 of this article here.

About Me

I am the owner and founder of Kafka Studios where we build music plugins and tools powered by Artificial Intelligence.

I also compose video game-styled music in my free time.