Week 6 — Tune It Up

Oğuz Bakır
BBM406 Spring 2021 Projects
6 min readMay 23, 2021

Hello world,
We are Fidan Samet, Oğuz Bakır and Adnan Fidan. In the scope of the Fundamentals of Machine Learning course project, we are working on music genre transfer and prediction. We will be writing blogs about our progress throughout the project and this is the sixth one of our blog series. The last results obtained on music genre prediction and transfer tasks, and our contributions to CycleGAN model for music style transfer will be covered in this post. So let’s get started!

Previously on Tune It Up…

Timeline of Tune It Up

In the previous weeks, we used Naive Bayes, k-Nearest Neighbors (k-NN), Random Forest and Multi-layer Perceptron (MLP) models to predict music genres. We achieved the best test accuracy as 85.40%. However, class specific test accuracies were low. For instance, we obtained %53 test accuracy in Jazz genre with the model trained on 3 genres. Since we aim to use this classifier for evaluating our style transfer model, we need higher accuracies.

On top of that, training style transfer models takes ~12 hours. To perform style transfer on 3 genres, we need to train 6 models. We do not have enough time and computing power for this. Therefore, we decrease the number of classes in the dataset. Since our dataset contains only piano as an instrument, keeping Jazz and Classic music makes more sense. That’s why we eliminate Pop tracks to solve mentioned problems. So we retrain prediction models on 2 genres.

Last week, we talked about the initial results on music genre prediction and transfer tasks. You can find last week’s blog here. This week, we will talk about latest results in our prediction model, transfer results of CycleGAN model and our contributions to CycleGAN model.

Music Genre Prediction

We train k-NN model again with same number of neighbors range using reduced dataset and below is the corresponding accuracy plot.

Distribution of Accuracy Over Number of Neighbors for k-NN Model

We train Random Forest model again with same max depth range using reduced dataset and below is the corresponding accuracy plot.

Distribution of Accuracy Over Max Depth for Random Forest Model

For MLP model, since there are hyperparameters to tune, we perform exhaustive search over parameters and obtain the best ones. After retraining all the models with the dataset containing 2 genres, we obtain the best test accuracies as the below table shows. MLP model gives the best accuracy. It achives 90% accuracy in Classic genre and 83% accuracy in Jazz genre.

Best Test Accuracies Obtained with Different Algorithms

Music Genre Transfer

To perform style transfer, we use CycleGAN model as our baseline. For initial training, we use vanilla CycleGAN model by editing the data loader of the model, because it accepts only images. Since our dataset contains 64x84 numpy arrays, we perform numpy to PIL image translation in the data loader of the model. As result, we obtain 64x84 images. For data pre-processing, we only perform normalization which results values between -1 and 1. We obtained the following image and music results for this experiment.

Input(Left/Jazz) and Output(Right/Classic) Image of Vanilla CycleGAN Model
Audio Form of the Input MIDI File — Jazz
Audio Form of the Output MIDI File — Classic

From vanilla CycleGAN results, we see that some changes are needed to perform style transfer on MIDI representations, because there are random generations on outputs. CycleGAN model uses ResNet blocks as generator model. We change this generator model with U-Net network to see that if ResNet generator network has any role on random outputs. To use U-Net network, we need to resize our 64x84 MIDI representations to 128x128 images, since U-Net 128 network only accepts square inputs. We also use normalization as pre-processing. After model operations, these images are resized back into 64x84 numpy arrays to successfully map them into MIDI files. We obtained the following image and music results for this experiment.

Input(Left/Jazz) and Output(Right/Classic) Image of CycleGAN with U-Net Network
Audio Form of the Input MIDI File — Jazz
Audio Form of the Output MIDI File — Classic

From the outputs, we see that U-Net network creates too many artifacts. So we decide to use ResNet network. In the previous experiments, we use the normalization that gives values between -1 and 1 as vanilla CycleGAN model. However, our MIDI representations consist of True and False values which are 1 and 0. Therefore, we do not normalize our data in the next experiment. To obtain the results in 0 to 1 range, we use Sigmoid activation function instead of Tanh activation function as vanilla CycleGAN. We obtained the following image results for this experiment.

Input(Left/Jazz) and Output(Right/Classic) Image of CycleGAN with Sigmoid Activation Function

From the outputs, we observe that Sigmoid activation function makes visible changes on output notes. Then, to train better note representations to the model, we add additional discriminators as proposed in [1]. We use the following formulas in these additional discriminators to enable generators to learn better high-level features.

Additional Discriminator Loss Formulas

Here, Xm is a random sample from 2 genre domains, Xa_hat and Xb_hat are transferred samples. By doing so, we retain existing structure of the input. We obtained the following image results for this experiment.

Input(Left/Jazz) and Output(Right/Classic) Image of CycleGAN with Sigmoid Activation Function and Additional Discriminator Losses

From the outputs, we observe that additional discriminator losses create better note preservation while making changes on notes. As our last contribution, we add a new loss called Triplet Loss to perform clear translations. Below is the formula of this Triplet Loss.

Function of Triplet Loss

This loss function holds an anchor point and tries to optimize similarity with positive sample and dissimilarity with negative sample. If we map these variables into our problem, anchor point is selected from domain A, positive sample is selected from same domain and negative sample is selected from other domain. While transferring from domain A to domain B, we aim to move B and fakeB (transferred data from domain A to B) closer while moving A and fakeB away. We obtained the following image results for this experiment.

Input(Left/Jazz) and Output(Right/Classic) Image of CycleGAN with Sigmoid Activation Function, Additional Discriminator Losses and Triplet Loss

From the outputs, we observe that triplet loss enables model not to only remove chunks of notes, it changes duration of notes and preserve pitch of input notes instead.

Note that in the last 3 experiments, due to wrong translation of MIDI files, we could not create audio files. We are currently retraining these experiments with correct MIDI translation.

That is all for this week. Thank you for reading and we hope to see you next week!

Bob Ross Says Goodbye

References

[1] Brunner, G., Wang, Y., Wattenhofer, R., & Zhao, S. (2018, November). Symbolic music genre transfer with cyclegan. In 2018 IEEE 30th International Conference on Tools with Artificial Intelligence (ICTAI) (pp. 786–793). IEEE.

Past Blogs

--

--