# A Closer Look at the Million Song Dataset

May 26, 2016 · 3 min read

The intersection between music and computation, or commonly referred to as computational music analysis, is a growing field. Over the last few weeks, I was fortunate enough to be a part of a team that immersed themselves in the field. In specific, the team chose to look at a subset of the Million Song Dataset and see what musical or non-musical traits make up genre.

Initially, we had to find a way to fetch the data from the MSD dynamically. I personally felt strongly about creating such a tool so that non-programmers (i.e., music theorists) have the opportunity to work with different data attributes. You can find the code here.

## The Data

```> summary(year) data.year
Min. :1962 1st Qu. :1994 Median :2002 Mean :1999 3rd Qu. :2005 Max. :2010```

The data value ‘year’ is a collection of data of when a song was published. Data was extracted from the MSD.

Let’s look further into specific attributes from the MSD. Two values that have an interesting relationship are year and loudness. Below is some output/plot that illustrates a specific behavior.

```> cor(dmod2\$loudness,dmod2\$year)
0.3876528
> boxplot(loudness~year,data=dmod2,col=kclust\$cluster)
> plot(loudness~year,data=dmod2,col=kclust\$cluster)
> abline(-348.8335,0.1705) #(Intercept,slope)```

The cor function takes two data values as its argument and the result is in the range [-1,1]. The result shows that there exists some positive correlation between loudness and year. The plot and box functions produce the above graphs. The colors represent the cluster to which each data point belongs. Ideally, the colors should not have too much overlap, so this particular k-means cluster is not very useful. In both plots, it is clear that as time progresses loudness experiences a gradual upward trend. As a qualification, it must be stated that the majority of the songs in the dataset are relatively recent, with the median song being from 2002.

1 clap

Written by