Spotify Data Exploration with Python
How we can use data science to discover the latest trends in the music industry
Table of Contents
Introduction
As the world’s largest music streaming service provider, Spotify’s data can provide valuable insight into past, current, and possibly even future trends.
In this article, I’ll be analyzing Spotify data on 160k+ tracks released in the past century.
To do this, I’ll be using Pandas, matplotlib, and NumPy.
Data Source
This is the dataset that I’ll be using:
Note: the original source is now a dead link; the above link is the new source. As the dataset is updated overtime, there may be differences between the dataset used in this article and the one available at the above link.
This dataset contains data on tracks released from 1921 to 2020. The data entries include the following features:
id
: ID of track generated by Spotifyid_artists
: ID of artist generated by Spotifyacousticness
: ranges from 0 to 1danceability
: ranges from 0 to 1energy
: ranges from 0 to 1duration_ms
: duration of track in milliseconds; integer typically ranges from 200k to 300kinstrumentalness
: ranges from 0 to 1valence
: how happy the song is; ranges from 0 to 1popularity
: ranges from 0 to 100tempo
: float typically ranging from 50 to 150liveness
: ranges from 0 to 1loudness
: float typically ranging from -60 to 0speechiness
: ranges from 0 to 1year
: ranges from 1921 to 2020mode
: 0 = minor, 1 = majorexplicit
: 0 = no explicit content, 1 = explicit contentkey
: all keys on octave encoded as values ranging from 0 to 11, starting with C as 0, C# as 1, etc.artists
: artist’s namerelease_date
: date of releasename
: name of song
Setup
First things first, I’ll import the Python libraries that we’ll need throughout this project and load the dataset into a Pandas DataFrame:
Exploratory Data Analysis
Now we’re ready to dive into data analysis!
Let’s start by looking at some quick characteristics:
Results:
Now, we’ll take a look at how all these different features correlate with each other:
We can see that acousticness and energy are strongly negatively correlated, while year and popularity are strongly positively correlated. Let’s plot these two sets of features along with lines of best fit to better show their relationship:
Now we can visually see the negative correlation between the energy and acousticness of songs. Acoustic music lacks electrical amplification, and we generally relate this to more gentle and slow music. On the other hand, we can think of energy as a measure of the intensity and activity of a track. Based on these definitions, a negative correlation checks out.
There is a strong correlation between popularity and year in this dataset. This makes sense, as if a song has just been released, we would expect many people to listen to it right away on Spotify. As for older songs, they may fall out of relevancy (so people listen to it less and less) or people may listen to the track by other means, such as record players and DVDs.
Finally we’ll take a look at a phenomenon called the loudness war, which is a term coined to represent the ever-increasing loudness of music.
Note: to give a better idea of average overall loudness, 0 is the limit at which music starts to clip as the sound is maxed out.
As expected, the correlation between loudness and year is close to 1. On average, 2020 has been the loudest year in the past century!
Conclusion
That’s all for my Spotify data analysis.
If you want to learn more about what you can do with this dataset, check out my article on predicting Spotify song popularity here: