A Closer Look at the Million Song Dataset

Published in

Modeling Music

3 min readMay 26, 2016

The intersection between music and computation, or commonly referred to as computational music analysis, is a growing field. Over the last few weeks, I was fortunate enough to be a part of a team that immersed themselves in the field. In specific, the team chose to look at a subset of the Million Song Dataset and see what musical or non-musical traits make up genre.

Initially, we had to find a way to fetch the data from the MSD dynamically. I personally felt strongly about creating such a tool so that non-programmers (i.e., music theorists) have the opportunity to work with different data attributes. You can find the code here.

The Data

Next comes the analysis. On the coding side, the data crunching involved python and R. Why python and R? Both have either native tools or extensive libraries to handle nearly any computational need imaginable. Below is an example of the power of R using just one line of code:

> summary(year) data.year
Min. :1962 1st Qu. :1994 Median :2002 Mean :1999 3rd Qu. :2005 Max. :2010

The data value ‘year’ is a collection of data of when a song was published. Data was extracted from the MSD.

Let’s look further into specific attributes from the MSD. Two values that have an interesting relationship are year and loudness. Below is some output/plot that illustrates a specific behavior.

> cor(dmod2$loudness,dmod2$year)
  0.3876528
> boxplot(loudness~year,data=dmod2,col=kclust$cluster)
> plot(loudness~year,data=dmod2,col=kclust$cluster)
> abline(-348.8335,0.1705) #(Intercept,slope)

The cor function takes two data values as its argument and the result is in the range [-1,1]. The result shows that there exists some positive correlation between loudness and year. The plot and box functions produce the above graphs. The colors represent the cluster to which each data point belongs. Ideally, the colors should not have too much overlap, so this particular k-means cluster is not very useful. In both plots, it is clear that as time progresses loudness experiences a gradual upward trend. As a qualification, it must be stated that the majority of the songs in the dataset are relatively recent, with the median song being from 2002.

Music Analysis

Musically, there could be many reasons as to why there is such a trend. One major reason could be the advancement of technology. In particular, the ubiquity of amplifiers did not take place until the 1970’s. This had a huge effect on not only the existing genres at the time, but also made way for new genres and sub-genres. A good example of this is metal or punk. The loudness of the type of music is potentially a by-product of the new amplifier technology, as those genres could now surpass certain decibel thresholds. Another cause could be the move from A.M. to F.M. for radio broadcasts of music. With A.M. having lower frequencies responses, shorter decibel ranges, and many other features that contribute to overall audio quality, producers, artists, and engineers were limited to say the least. With improved audio broadcast quality, the more dynamic sound range, the more variation in loudness that a song can have.

A Closer Look at the Million Song Dataset

The Data

Music Analysis

Written by Jeremy