By Cecil Fernandez Briche and Raquel Bujalance
In the same way we all consume products or services produced by engines that have been invented to automate processes or replace human workload, could you imagine listening to music created through AI? Even better, if you could generate a new melody in one click just for you, inspired by your current mood or based on your preferences, with no need to have the creativity or composition knowledge of a musician? Some geniuses said that music is like mathematics and that’s pretty similar if you think in time series: instruments are like different classes, notes are values on the vertical axis and all the observations over time (on the horizontal one) will mark the rhythm of each track. Then, from a “solfège” (or commonly described as “musical theory”) point of view, we can talk about patterns, rules & constraints.
That’s why BrainMusic experiment is about: creating music from emotions.
Project set-up: Objectives & Scope definition
Before starting anything we needed to determine the scope and list all the requirements that will ensure the project feasibility. Moreover, it was crucial to delimit each step to build as soon as possible a first version of our model, like an MVP (Minimum Viable Product), and then iterate on it to fine-tune the parameters and improve the results.
Regarding “music” scope, it was so attractive to think at the beginning we could work on a wide scope with different genres and creating songs combining the stylish touch of famous artists. As we had heard that deep learning models would work better if we focused on a unique genre, and after reading many papers about state-of-the-art research in music IA, we realized that we should rather limit our scope even more to be able to create something “nice to listen”. That’s why we finally did go with a unique style and instrument, classical piano, as it also seemed to be the easiest solution to find a larger dataset.
Now, regarding “emotions”, even if we initially wanted to play with many emotions, then we decided to start with a few opposite and basic ones: “sad” and “happy” in a first instance, and then adding “calm” and “nervous”. We’ll talk later about this tricky part and how we can define and label these emotions, because we all understand that emotions are as much subjective as music feeling is. For example: a person (A) can hate a type of music (i.e. reggaeton) that can make another person (B) feel desire to dance whereas (A) probably would need to escape from this kind of noise. On the contrary (A) can enjoy rock music whilst (B) may feel bored.
Data Collection: Extraction, Cleaning & Visualization
To build our sample, we directly gathered files in midi format from websites with open midi database. Data cleansing is an essential process in any AI project, and, as expected, we had to spend a lot of time cleaning and analyzing the data. We had compiled MIDIs with piano, but some tracks were corrupt, duplicated or None values and many others besides the piano included other instruments in the performance.
Preprocessing: Features Understanding, Extraction and Classification
We wanted to understand what were the main characteristics of our melodies, and how to identify emotions. In order to help us label and classify the themes across emotions within a whole audio composition, we assigned unique tags for each feature. This is no simple task. Bear in mind that music are time series and you can have several tracks, each of them with several parts with different tempos and mixing a lot of chords and harmonies along the same composition. Even more, a lot of classical pieces have sections in contrasting keys or tempos different than the main ones. This means that sometimes even the music experts themselves can doubt about the main mode or tempo of the song, and depending on the quality or precision, inclusive of the instrument they are listening to.
We would like to recognize the amazing work behind music21 and Magenta libraries, that offer Python-based toolkits for computer-aided musicology. It helped us a lot to extract all the features we needed for this project.
We used several methods to detect all the instruments, keys and tempos of the themes. Then we implemented and applied some additional functions to extract the main mode and tempo as unique values and finally created bins based on typical annotations in piano music sheet.
With all these ingredients we were ready to cook our recipe and be able to classify the different melodies into emotions, following a common interpretation of feelings associated to classical music in musicology:
We previously had meticulously selected a balanced sample between minor and major themes, and inside each mode, we could observe a normal-like distribution of the tempo bins, with greater proportion of “moderato”, that actually made sense and corroborated that its central point was also the statistical mode, mean and median of both subsamples.
As we wanted to work with a more robust sample, we applied two different techniques to augment our dataset:
- Many of the midis collected also contained more instruments than just piano or whose instrument was different (violin, flute, etc), thus given that MIDI files describe all notes from a composition, the best option to get more diversity of melodies and patterns, and also a good idea to get more polyphonic data as input and richer and more complex melodies as outputs, was to convert the General MIDI messages from other instruments (single and multiple) to piano:
- Main key transposition of the track to different notes (respecting his main mode: major or minor)
This technique was not only useful to increase the sample size but also allowed us to reduce sparsity in the training data, making the examples more generic as they provide a key (tonality) invariance to all examples.
To go in-depth on that part, you can see the code available in git that will give you more step-by-step details about our data cleaning and features engineering.
Now that we had all the data ready, we needed to create a model that would generate new melodies. As we mentioned at the beginning of the post, music can be considered as time series, subject to patterns and rules.
In fact, a midi file has a graphical representation with the pitch on the y-axis and the time of each note on the x-axis, as you can see on the following examples of visualization. The first image below corresponds to a monophonic -only one note at a time-, while the second one to a polyphonic melody -more than one note at the same time-.
What kind of models could we use to create music? A GAN or LSTM/GRU could be great options. GANs models were good candidates to generate distortions on a specific music, to produce interpolations between two different themes/genres, or to add instruments, but in order to create long sequences of notes maintaining a correct structure, a LSTM network or a GRU RNN seemed to be more suitable a priori. This type of models were more appropriate to process sequential information understanding the inherent pattern.
“Generating long pieces of music is a challenging problem, as music contains structure at multiple timescales, from millisecond timings to motifs to phrases to repetition of entire sections” Huang, Simon & Dinculescu. Magenta
After investigating the latest IA trends and algorithms applied to the music field like deep jazz, BachBot, GRUV and Magenta, we decided to base our model on this last one. In the words of its creators, “Magenta is an open-source research project exploring the role of machine learning as a tool in the creative process”. This Google project contains several pre-trained models to create music, from GANs or LSTM, through Keras, etc, some of them considered today as the “State-of-the-art” in music creation. Another great advantage of this Open Source library was that it also offered the possibility to train our own models by adapting the structure.
We tried three different classes to create music, all based on LSTM layers, but incorporating different frames, usually incorporating new features to the previous model.
After testing these models with our training sample, we both agreed that our favorite performances were the ones obtained through Polyphonic RNN (later in this post we’ll put some results as examples). In our GitHub repositories you can find the notebook template used in colab to train it.
Polyphonic RNN is a type of recurrent neural network that applies language modelling to polyphonic music generation using a LSTM. The model took inspiration from BachBot, a research project to build artificial intelligence which can generate and harmonize chords in a way that’s indistinguishable from Bach’s own work. The pre-trained model available within Magenta had been also trained with several datasets such as the Bach chorales one. In contrast, we trained the model with our own sample that we suspected to be less structured, as the collected MIDIs could contained both classical and modern music and we were not able to know its exact distribution. That fact increased the challenge difficulty since it made more difficult to identify what we could consider “nice melodies”, as far as contemporary piano uses more dissonant components into its harmonies, with complex rhythms that could sometimes sound “atypical”, “unexpected” or “strange” for many people.
Using a LSTM model with sequential layers is a common practice in text analysis problems and in the music field it works exactly the same. As there are thousands of amazing posts explaining how recurrent neural networks work and specifically about the LSTM model, we will just focus on how this is implemented in Magenta.
Taking a look inside the model, polyphony uses as an input a single stream of note events with special START, STEP_END and END symbols. Within a step, notes are sorted by pitch in descending order, in a tuple.
The tuples represent the condition (new note or continued note) to distinguish whether a note is a continuation of another note from a previous pitch or is a new incorporation. The second value is the pitch (between 0 and 127), a numeric value that represents the MIDI pitch of a note. For each chord a “start” and “end” symbols are added, and additionally “step-end” marks the end of each step.
In Magenta library is possible to specify the number of layers, the units inside each layer, and the batch size for batch normalization. The implementation also allows to select the dropout level, that randomly selects a percentage of neurons in the LSTM units that are probabilistically excluded from activation, and weight updates while training the model. The outputs are followed by a fully-connected layer which is passed through a Softmax to yield a predictive distribution. The model is trained using an Adam optimizer, the parameterization in Magenta enables to define, a priori, the learning rate and maximum norm for the gradient clipping during backpropagation time gradients. The initial paper for BachBot “Automatic stylistic composition of bach chorales with deep LSTM” tested a grid of hyperparameters for a similar model to Magenta implementation. Our selection is based on this study but also restricted by memory availability within the training environment.
Ready to listen some of the BrainMusic creations?
But it also produces some cacophonies… or maybe we’re not ready yet for the music of the 21st century
Conclusions and possible evaluations of the models
Evaluating model performance is more problematic, although we have checked accuracy metrics both in training and testing, mainly to avoid overfitting. In this case we don’t want the model to learn exactly the melodies or to perform only minimal alterations. The accuracy here would represent the capability to predict the following notes with its associated rhythm, tempo, harmony, volume and all other composition components at each step and batches. Here we are not really predicting money or events we need to know in advance but just using this network to make the AI learn to play taking into account all the composition components. In the same way, the final objective when musicians want to create or people want to listen to new music themes, them want something “special” and not a copy neither a repetition of the same events.
In order to allow the model to improvise in a higher extent, we have implemented two alternative techniques: first increasing the sample by key transposition to provide a key (tonality) invariance of all sample and then including a not negligible dropout in the training process.
This means here the predictability factor should not be the most relevant metric as we would like to listen a creation that sounds nice as well as striking or original.
We could evaluate them from a human perception but it would be subjective as well as a manual and long process: for example, we could listen to the new creations and observe the following criteria to be able to give some feedback:
- if it sounds generally consonant (perfect intervals in polyphony) or dissonant (similar to notes played wrong).
- if it respects the Circle of Fifths the relationship across chords along the whole composition: if the scale and harmonic progression makes sense
- if the melody sounds nice as well as striking or original
- if we can appreciate some dynamics, phrasing, and if we can easily recognize a beginning, different sections and ending.
- if the articulations sound natural into the tempos along the whole piece
- if it has its own musical self-expression
- if we can feel the correspondant emotion to the main timbre
- Performance and Complexity
As in real life, music is evaluated and appreciated by human feelings or needs, even sometimes if we analyze the most popular songs, people could prefer noise, repetitive o simple compositions more than beautiful and complex music.
These algorithms could be useful in different contexts and easily implemented in production in digital products as great and useful tools with many business applications such as:
- help artists to create their next album with their own artistic signature or to generate new fusion or style of music.
- help producers, sound engineers and studios to offer the best software plugins for generating or optimising background music, inspire artists in their compositions and recordings.
- help brands to select a meaningful song in their announcements that will be aligned with their positioning and make them transmit better their advertising message.
- help producers/scenarists to choose the best theme for some specific scenes.
- help digital products and/or distributors of the music industry to offer innovative playlists mood-based or create new songs with no need to pay so much loyalties (maybe Spotify could be more profitable).
- But also help people with emotional disorders such as depression, stress, bipolarity, anxiety, panic, phobia, etc, through music therapy.
This project has been the result of our participation in the deep learning study group within @SaturdaysAI Madrid. Saturdays AI is a non-profit in a mission to empower diverse individuals to learn Artificial Intelligence in a collaborative and project-based way, beyond the conventional education path.