A Night at the JADS opera

On October 30, 2018 a unique musical experiment took place in the Chapel of JADS, the Jheronimus Academy of Data Science in Den Bosch. The event was organized during the first Den Bosch Data Week, in collaboration with the International Vocal Competition (IVC). My recommender LAB took the challenge to tailor a program containing six opera arias to the Spotify profiles of the audience. This was a particular challenge as most of the audience was not familiar with opera or classical music in general. During the concert we asked the audience for their feedback to check if our predictions and tailoring worked well. In this story I will report how Yu Liang, my Phd Student and I approached this problem and what results came out of the event.

Nina van Essen, accompanied by Marta Liebana during the concert (photo: bymarjo.nl)

Requirements
To be able to tailor a concert program, we require two things. First we need a candidate set of songs that the musical performer (in this case, the mezzo soprano Nina van Essen, accompanied on grand piano by Marta Liebana) can perform. Second, we need information about the music preferences of the audience to select the best possible items from the candidate set. Nina van Essen provided us with a diverse range of 32 candidate songs, opera arias spanning several musical eras from Monteverdi (~1600) to Bernstein (~1950). All people registering for the event were asked to login with their Spotify account. This allowed us to collect their top tracks and favorite artists using the Spotify API. We got 47 registrations for the concert with in total over 5000 top tracks.

How to tailor?
For this concert we focused on the musical features as they are provided by the Spotify API. These features describe the energy, valence, key, liveness, tempo and other audio features in a track. For our matching we focused on the following key features:

  • Acousticness: The acousticness of the track
  • Danceability: how suitable a track is for dancing based on tempo, rhythm stability, beat strength, and overall regularity
  • Energy: a perceptual measure of intensity and activity.
  • Valence: the musical positiveness conveyed by a track.

So how did the user profiles of our audience compare to the features of the 32 candidate songs we got from our performer? Figure 1 (left) shows the values on the features of the candidate set and (right) of the top tracks of our audience… As one can see, opera songs are quite different from the regular music our audience was listening too. Like most classical music, opera Arias are high in acousticness and low in energy, valence and danceability, much lower than the high energy and valence and danceability of the tracks of our audience.

Figure 1: Boxplots representing the values of the songs on their musical features in the candidate set (left) and the top tracks of our audience (right)

Nevertheless, we should be able to select the best matching tracks from the candidate set for our audience. For this we took a group recommendation approach. First we matched the 32 candidate songs on each individual user profile: the more a track matches the common values of danceability, energy or valence of that user, the better that track would match that person. So for each person we calculated a ranking of the 32 songs. Then we took those individual rankings and checked which tracks would have the overall highest rank for the entire audience. From the 32 candidates, we selected a Top 10 and gave the performers the freedom to select 6 tracks from this Top 10 to perform during the concert. Figure 2 shows that the matching on the features seems to work, our recommender algorithm picks those candidates that are the highest in valence and danceability and energy.

Figure 2: Showing how the final list (middle) picks the best songs from candidate list (left) that match best the top tracks of the audience (right)

Below is the final playlist as can be downloaded and played from Spotify.

Figure 3: The final playlist of the JADS opera event

How did the people in our audience differ? 
Though the set was tailored towards the entire audience, different types of people were present. We retrieved genres from the top tracks of the users, and clustered the audience on these genres, resulting in four distinct groups of listeners. In the on-boarding questionnaire we asked when people registered for the concert we also asked their age as well as their musical sophistication (MS) using a questionnaire.

genre cloud of classic-rock-pop group
genre cloud of rock-pop-folk group
genre cloud of pop-rock group
genre cloud of pop-hiphop-house group

The first group we call classic-rock-pop and consists of a few listeners that listen to both classical music and some rock and pop, among some other diverse genres (see genre cloud). This was the oldest group (mean age 43), with the highest MS: 4.91. As you can see from figure 4, they score highest on acousticness and lowest on the other 3 features. In other words, they are closest to the opera profile of our concert program. The second group we indicate as rock-pop-folk, slightly younger (mean age 35 and also high on MS: 4.82). They score higher than the first group on the features, but not as high as the last two groups, that have more danceable and energetic songs in their top tracks. These two groups are younger with lower MS scores: the pop-rock people are around 32 years old with and MS of 4.76, whereas the pop-hiphop-house people are on average 28, with the lowest MS of 4.52.

Figure 4; Profiles of Musical features of each group/cluster in our data

Live data!
During the concert, we also took a data-driven approach. People in the audience got access to a web-app (see Figure 5) in which they would see their personal genre cluster and their MS score. that After each song we asked the audience to rate the song in terms of liking and in terms of personalization. After two songs we also asked each person to rate which of the two songs they liked most. As we had individual predicted rankings from our recommender algorithm for each individual, we could also state which song we predicted they would like more. So after indicating their own preferences between the two songs, the app would tell them what we had predicted.

Audience providing ratings during the concert using the web app as shown on the right (photo: bymarjo.nl)

During the concert we gave live feedback after every two songs, showing with boxplots how each group/cluster in the audience rated each song and what their preference was between the two songs (see Figure 6). During the concert we noticed that in general, people liked the songs better that were slightly more energetic/danceable. For example, in Figure 6, we see that they experience the second song, which was more energetic than the first, to be better and more personalized. This was indeed in line with our predictions, but we have the data to more thoroughly test this!

Figure 6: Feedback given about the ratings provided by the audience during the concert
Discussing the results during the concert (photo: bymarjo.nl)

The results: could we predict what people liked?
Using the ratings provided during the concert, we are able to test if the predicted ranking of the songs based on our recommender algorithm makes any sense. As we did the ranking only on 3 musical features,there are limitations to how much we can predict, of course. Moreover, we have data of 40 listeners in the audience, which is a limited sample. We therefore do not analyse the differences between the four groups but only report the overall results.

In the first analysis, we test if songs that are ranked better (by our algorithm) for a user, also get higher likings and are perceived to be more personalized. Figure 7 shows the ratings as a function how each song was ranked for each user. Ideally the song with rank 1 (our best prediction) should also have the highest liking and the highest level of personalization. We observe that on average people liked the songs a lot: the median rating is 4 out of 5 stars. The average level of personalization is slightly lower, as we expected given that there are hardly any opera people in the audience. We also observe that on average the three highest ranked songs are liked slightly better (and perceived to be more personalized) than the lowest ranked songs (4–6), with the exception of the song with rank 6. For most users the lowest ranked song was the 5th song performed during the concert, by Gounod, which during the concert made a strong impression, probably due to the very virtuous ending (listen for yourself!). We ran a multilevel linear regression, in which we tried to predict the liking and personalization, based on whether the song was ranked high (1–3) or low (4–6) and find that for both models this was a significant predictor of liking, supporting the pattern observed in Figure 7. In other words, our predicted ranking based on the recommendation algorithm was related to the actual liking and perceived personalization of the songs.

Figure 7: Ratings of the audience for each song, as a function of the predicted rank (individually or each user)

We also asked people after each two songs which they liked best on a 4-point scale, ranging from much more song 1 to much more song 2. Figure 8 shows the histogram of responses on the scale, as a function of our predicted preference. Most people liked the second song of each pair of songs better than the first, consistent with the fact that our algorithm also predicted the second song to be best 81% of the time. We actually constructed the concert program such that this was the case (for simplicity and to make sure to have a good ending…). Note however that we did this based on the predicted rankings, that apparently had some actual value.

Figure 8: Actual preference for the first and second song of a pair as a function of the predicted preference based on the rankings in our algorithm.

However, our algorithm makes both correct and incorrect predictions, so let’s have a look at the actual numbers for each combination. In only 18 of 99 cases, our algorithm predicted song 1 to be best, and in 9 of these it was right, a 50% rate (not more than chance level). Fortunately, the algorithm was able to predict correctly that song 2 was better in 56 out of 81 cases, 69% of the time. Looking at the overall confusion matrix, the algorithm achieves an accuracy (AUC) of 0.66 (95% CI is .55-.75, so we are able to predict above chance). Sensitivity (recall or hit rate) was 0.86 but Specificity (true negative rate) however is low (0.26).

Conclusion
Despite the fact that we could only tailor a concert program based on three Spotify musical features (energy, valence and danceability), our results show that songs we predicted to be ranked higher, were to some extent indeed like more, perceived to be more personalized and also won in a direct comparison.

Our preliminary results show that using our recommendation strategy, we are able to find those songs from a new genre that fit with their current preferences, and that help people to explore new tastes. We are currently developing a taste exploration app for this purpose. Moreover, the results show that we can use this in a group recommendation approach to compose a concert program that fits with the general taste of the audience. Our lab is working on group recommendations for Spotify to help groups find music playlists that match their common tastes.

More importantly, the concert was very much enjoyed by the audience, despite that most listeners never had experienced a live opera singer before. We hope that our tailored playlist has contributed somewhat to make this new experience as enjoyable as possible!