Hum2Song Multi-track Polyphonic Music Generation from Voice Melody Transcription with Neural Networks

Hum2Song! is an AI-powered web application that is able to compose the musical accompaniment of a melody produced by a human voice.

System overview

These are the components of Hum2Song!

The list of steps to create this solution are listed below:

  • Learn how MIDI files are structured
  • Scraping the website (16k files)
  • Decide the features to use
  • Data preprocessing
  • Stratified sampling
  • Evaluate several NN architecture combinations (325 per condition).
  • Fine tuning the best options
  • Convert the best model to tensorflow.js
  • Implement an https site that allows voice recording
  • Implement my model and Google Magenta models
  • Clean the noisy transcribed data
  • Get the genre, a drum, a bass, a tonal scale, and chords progression from the melody.
  • Create a song from progressions
  • Adapt a web music editor ( GridSound )
  • Publish the website
  • Promote online demo

Each of the steps and modules represents a significant challenge, this is why different techniques and AI technologies were used. In this article, we will focus to explain the genre prediction module in detail. If you want more documentation about the rest of the modules you should check out the Google Magenta project.

Predicting genre from the melody

Predicting genre is always challenging these are some of the reasons:

  • Genre is an ambiguous concept (i.e. Pop music stands for“popular” regardless of the genre)
  • Many songs combine different genres.
  • It is needed multitrack analysis for genre prediction
  • The same melody can be used for multiple genres

The accuracy is usually low as previous work shows:

Cory McKay, Automatic Genre Classification of MIDI Recordings

Hum2Song! transcribes the voice to notes by using the Onsets and Frames created by the Google Magenta team. That model was trained with piano sounds, this is why human voice transcription is deficient, noises and overtones are added to the output score and some cleaning work is needed before performing the model prediction.

We used a MIDI dataset extracted from, the scraper can be found here. Then we extracted features and preprocess the data. We created this tutorial in Google Colaboratory that describes all the steps needed to extract features from a single track, find the melody by using string algorithms such as LRS (Longest Repeated Subsequence) and LCS (Longest Common Subsequence), and finally, we train a model by testing all the combinations from 1 to 5 layers of a Multi-Layer Perceptron architecture.

We performed 4 experiments to detect how many and which features from a single channel stream of notes are needed to better predict the genre. The results were the following:

The best result was obtained from the drums track, this makes sense since the rhythm better describes the genre than the melody. We got the model by training the Neuronal Network with a vector of dimension 128 that represents the drums. We sampled 8 seconds of notes in 128 numbers, this means that we represented each second in 16 samples every 62 milliseconds we sensed for drum instruments playing.

This is the drums Neural Network Architecture:

  • Layers: 128, 64, 32, 3
  • Input: 1D Vector 128 features from drums
  • Output: 3 classes (Jazz, Electronic, Rock)
  • Activation functions: RELU & Softmax
  • Optimizer: Rmsprop
  • Loss function: Categorical Cross Entropy
  • Val_acc: 55.8%

At the end we implemented a model based in the melody, this was because the input we expect is the melody sung by a human voice. In this case, we used only a 64 features vector (4 seconds) since sometimes melodies are short, and that short period of time is the most descriptive.

This is the melody Neural Network Architecture

  • Layers: [64, 128, 16, 64, 256, 32, 3]
  • Input: 1D Vector 64 features from melody
  • Output: 3 classes (Jazz, Electronic, Rock)
  • Activation functions: RELU & Softmax
  • Optimizer: Rmsprop
  • Loss function: Categorical Cross Entropy
  • Val_acc: 48.6%


This project is the result of developing new models and integrating existing models. I implemented some of the Google Magenta models that implements RNN models.

The model was trained by using Keras. The generated model was ported to Tensorflow.js by using this tutorial. The MIDI tools, the external model implementations, and the player were taken from the Magenta.js library.


Music generation is a complex task that has been explored from different angles such as GANs (i.e. MuseGAN), RNNs (i.e. Magenta), and many others. There is no best technology for doing that since each technology has their own tradeoffs.

In this work, we limited to ensemble part of the existing efforts in a web application, and to develop some missing connectors that fit to own methodology. We envision that this kind of tools can help anyone to inspire by turning a musical idea into a song by only humming/singing it.

The code is available in this GitHub repository, feel free implement your improved version and let us know about it.

The application can be found in any feedback is welcome

Hum2Song User Interface