JADS chapel with Popsoul choir Rosmalen on stage

Singing with Data

9 min readDec 23, 2019

Can we use recommender technologies to help people to acquire new (music) preferences? We took that challenge again during the Den Bosch Dataweek. In 2018, we tuned a concert program of an opera singer to the Spotify profiles of the audience. This year, we invited two choirs. One choir (Pop-soul choir Rosmalen) with a repertoire of mostly popular music and one choir (Strijps Kamerkoor), a classical chamber choir. We tuned the program to the musical profiles of the audience, but also shaped the program in such a way, that we could predict for each visitor which song of each pair of songs performed by the choirs they would like most. We provided the visitors with an app to measure and show their preferences and test whether our predictions were correct. In this blog we discuss the setup and results, showing that we can indeed predict, to some extent, what songs our visitors would like best during a live concert based on their spotify profiles.

How can we tune a concert to the audience?
To be able to construct a concert program we first need to know the music our audience likes. To register for the concert, visitors logged in with their Spotify account, which allowed us to get their top tracks and top artists. They also indicated their general musical preference (popular or classsical) and we measured their musical sophistication index, a measure of their emotional experience and active engagement with music.

In total, 41 people registered for the concert, of which 10 indicated to have a more classical taste. Their Spotify profiles indeed showed marked differences. Below are two word clouds representing the top artists of the two groups. Those with a classical taste like classical composers, where as those with a popular taste like contemporary artists.

Word clouds of top artists of visitors with classical versus popular music taste

In terms of demographics, people with classical preference were older and had a lower Musical Sophistication Index (MSI), indicating less musical engagement or expertise, as shown in the boxplots below.

Music Sophistication Index (left) is higher and age (right) is lower for visitors with a popular taste.

How did we do the match between visitors and choirs?
Now that we know the music our visitors like, we need to match that with the songs the choirs can sing. To do the matching we use the musical features of songs as they are provided by the Spotify API. These features describe the energy, valence, key, liveness, tempo and other audio features in a track. For our matching we focused on the following four key features:

Acousticness: Amount of acoustic music (vocals, instruments) in the track
Danceability: how suitable a track is for dancing based on tempo, rhythm stability, beat strength, and overall regularity
Energy: a perceptual measure of intensity and activity.
Valence: the musical positiveness conveyed by a track.

We asked both choirs to give us a list of songs they would be able to perform. For this candidate set of 16 songs we extracted their musical features from Spotify (see profile on the left side of the figure below). If we compare this to the profile of the top-tracks of our visitors (right side figure below) we find that they liked songs that are more energetic and danceable and with higher valence then most songs in our candidate set. So we needed to select those songs that match best with the visitors.

For this we used a group recommender approach. We took the profile of each visitor and matched it with each candidate track. The more a candidate track matched the common values on the features of the visitor profile, the higher that track would be ranked for that visitor. So we get a ranking of the candidate tracks for each visitor. Then we took those individual rankings and calculated which tracks would have the overall highest rank for the entire audience. In principle we would use this to select the best tracks for the concert program.

Spotify feature profile (violin plots with integrated boxplots) of the candidate set, the visitor top tracks and the final selection of the 10 songs of the program.

However, our audience consisted of two distinct groups (classical or popular music taste), so we also looked at the differences in ranking for these two groups, as their feature profiles are very distinct (see profiles in figure below). Naturally, we find that for those preferring classical the classical songs from the classical choir are ranked higher, and the songs of the popular choir are ranked higher for the other group.

Interestingly, we find that rankings for visitors with either taste also differed within the set of classical or popular songs. In other words, when looking at pairs of songs within one category, we would find that visitors with a popular taste would like one song, while visitors with a classical taste would like the other. We therefore decided to layout the concert program in such a way that each choir would sing sets of two songs, of which we would predict one taste group would like one more, and the other taste group would like the other more. We programmed 5 sets of 2 songs, with the last set having the two choirs ‘compete’. The final setlist is below.

Final setlist with predicted ranking for the entire group and the two musical tastes of the visitors

Comparing the classical rank to the popular rank, we see that for each set, visitors with classical preference would prefer one song, whereas people with popular taste would prefer the other song. For example, in set 3, those with a classical taste would like song 5 (rank 4) over 6 (rank 9), whereas those with a popular taste would like song 6 (rank 5) over song 5 (rank 8).

Let the concert begin!

On October 29, the concert was held. We communicated the final program to the choirs a few days before the concert. About 70 people visited the concert.

During the concert, the audience was provided with an app, allowing them to login, view their personal MSI score and some information about each song. During the concert they would rate each song in terms of the quality and the level of personalization. After each set of two songs users would provide a ranking, whether they liked the first song or the second song a (bit) more, on a 4 point scale without midpoint. After their response the app would show how we predicted their ranking for the two songs.

Results from the app where processed on the fly and shown directly on the screen using Rmarkdown, separating the ratings for the classical and popular tastes. We allowed participants without prior registration to also use the app, showing their results as ‘unknown’.

for example, on the left you see a slide shown during the concert how each group think song 3 and 4 was personalized, as shown during the concert.

The concert itself was very succesful: the audience (and the choirs) appreciated the two very different musical performances and choral genres, as well as the data-driven nature and direct feedback given on the results. But of course we also want to know to what extent our predictions and matching worked out, and for this we need to do some data science!

Analyzing the data of the audience
Remember that we constructed the program in a specific way, such that we would be able to predict how each taste group would like each song in the sets. However, our matching and predictions were only based on a handful of Spotify features, whereas people’s responses in the audience are also strongly affected by the momentary experience of a live concert and the quality of the performance. As data scientists we like to understand better to what extent our predicted rankings were correct, given these constraints.

Our data allows us to predict two things. First, visitors rated each song in terms of quality and personalization on a 5 point rating scale. As we have a predicted rank for each song, we can see how well that rating follows the ranking. For both taste groups, and both questions (quality and personalization) we see that the ratings drop with rank, showing indeed that users rate tracks higher we predicted to be preferred (ranked better). We see that for the popular taste group, this pertains mostly to ranks above 4, for which the ratings drop with higher rank. For the classical group we should be careful not to imply too much given the limited data (N=10) we have for that group. Multilevel regressions predicting the 10 personalization and quality scores of each visitor with the predicted rank showed significant effects (p<.001) of rank as a predictor, as well as a smaller effect of position (songs later in the program got slightly higher ratings).

Level of personalization (left) and quality (right) as rated by the visitors as a function of their predicted rank (lower is better) of the song

The most interesting question is whether our data allows us to predict the pairwise ranking between two songs of a set as given by the visitors using the app during the concert. Below are our predicted differences for any of the 5 comparisons during the concert (c12-c910), next to the actual rated differences we got from the visitors via the app. A positive score means they like song 2 better than song 1 in that set. We see that, apart from the first set (c12), our predictions follow a similar pattern as the actual rated difference. For all sets except set c12, visitors of classical taste liked one song more, whereas those with popular taste like the other song more. This is of course how we designed to program based on the predictions, so the predicted differences are by design, but the results show that we find similar effects in the direct comparison (rated difference) the visitors made in the app during the concert.

Predicted and actual difference in relative liking of the two songs in each set. A positive number indicates song 2 is liked more than song 1.

To further test our predictions, we tested the agreement of each predicted ranking for each set of songs of each visitor to their actual ranking given during the concert. Simply put, if we predict that they like song 2 over 1, did they actual liked song 2 over 1 or the other way around? The confusion matrix below shows the overall result. If we predicted song 1 to be liked over song 2, 66% indeed liked it more. If we predicted song 2 to be liked over song 1, 61% indeed liked it more.

Confusion matrix of the pairwise predictions and actual likings for each pair of songs and each visitor

The average accuracy of our predictions is 63.6%, significantly above chance (50%) and above the no-information rate (p< .02, on average 53.5% chooses 1 over 2). This shows our simple prediction based on just the spotify musical features can to some extent predict what users would like most in the concert!

Conclusion

Our results show it is possible to tune a concert program to the spotify profiles of the visitors in the audience. Based on musical features of the song we were able to predict which songs the audience would like most and our pairwise comparisons show we could do this actually quite accurately for the different tastes of listeners (classical or popular). Moreover, visitors not only got a chance to better understand how recommender systems work and how they can be used to make predictions and recommendations, but also got a chance to enjoy a music genre they had not (often) heard before!

Singing with Data

Written by Martijn Willemsen