Week4- Artificial Pianist
How to Evaluate the Model ?
Artificial intelligence and artificial creativity are closing to the human-level day by day but evaluation of the generative art projects, especially generative music is still a challenging and precarious task because of the definition of art and creativity. And subjective evaluation that means reviews of the listeners seem to be the only possible methodology of choice like Turing test. However, we want to implement both an objective evaluation and subjective evaluation to evaluate the model.
Subjective Approach
We decided to evaluate the generated music pieces with using feedbacks of listeners. We are going to use Google Forms survey to get feedbacks from listeners. We are going to design an experiment that can measure listeners reviews on the music piece with combining original artworks of composers and generated music that created by The Artificial Pianist. We are planning to use at least 50 participants for the survey. The survey includes 10 ranks for voting from trash to masterpiece. We will evaluate music pieces using the results of the survey and compare points of our artificial music with the points of the real artworks.
Objective Approach
We find out there are different evaluation metrics like metrics based on probabilistic measures, metrics that are specific to the task and metrics that require general knowledge about musical domain.
As an example of the probabilistic measures, Huang et al. propose an evaluation computing the negative loglikelihood.
Negative Log Likelihood loss achieved on the validation set is the most common metric to measure the predictive capability of a generative model and its overall efficiency.
The other metric that used in the evaluation of generative music, The MGEval (Music Generation Evaluation) framework designed by Yang and Lerch in 2018. The MGEval toolbox computes the distance between probability distributions across the training data and generated notes using KL-divergence to evaluate the performance of the model.
There is another metric BLEU score that stands for Bi-Lingual Evaluation Understudy, designed for machine translation models to measure the performance of the model automatically and also deployed commonly for sequence generation models.
BLEU score computes a geometric mean of the counts of matching N-grams (specifically 1, 2, 3, and 4-grams) between a set of base data and a generated dataset generated. A BLEU score can be between 0 and 1. Score 0 means there is nothing in common and score 1 means they exactly match. For tasks like music generation we do not expect to reach score 1, a high score may refer to that model is overfitting.
Stay Tuned :)
References
Huang, C.Z.A., Cooijmans, T., Roberts, A., Courville, A.,
Eck, D.: Counterpoint by convolution. In: International
Society of Music Information Retrieval (ISMIR). Suzhou,
China (2017)