Spotify Data Exploration with Python

How we can use data science to discover the latest trends in the music industry

Code AI Blogs

Published in

CodeAI

4 min readAug 24, 2021

Introduction
Data Source
Setup
Exploratory Data Analysis
Conclusion

Introduction

As the world’s largest music streaming service provider, Spotify’s data can provide valuable insight into past, current, and possibly even future trends.

In this article, I’ll be analyzing Spotify data on 160k+ tracks released in the past century.

To do this, I’ll be using Pandas, matplotlib, and NumPy.

Data Source

This is the dataset that I’ll be using:

Spotify Tracks data

www.kaggle.com

Note: the original source is now a dead link; the above link is the new source. As the dataset is updated overtime, there may be differences between the dataset used in this article and the one available at the above link.

This dataset contains data on tracks released from 1921 to 2020. The data entries include the following features:

id: ID of track generated by Spotify
id_artists: ID of artist generated by Spotify
acousticness: ranges from 0 to 1
danceability: ranges from 0 to 1
energy: ranges from 0 to 1
duration_ms: duration of track in milliseconds; integer typically ranges from 200k to 300k
instrumentalness: ranges from 0 to 1
valence: how happy the song is; ranges from 0 to 1
popularity: ranges from 0 to 100
tempo: float typically ranging from 50 to 150
liveness: ranges from 0 to 1
loudness: float typically ranging from -60 to 0
speechiness: ranges from 0 to 1
year: ranges from 1921 to 2020
mode: 0 = minor, 1 = major
explicit: 0 = no explicit content, 1 = explicit content
key: all keys on octave encoded as values ranging from 0 to 11, starting with C as 0, C# as 1, etc.
artists: artist’s name
release_date: date of release
name: name of song

Setup

First things first, I’ll import the Python libraries that we’ll need throughout this project and load the dataset into a Pandas DataFrame:

Exploratory Data Analysis

Now we’re ready to dive into data analysis!

Let’s start by looking at some quick characteristics:

Results:

Now, we’ll take a look at how all these different features correlate with each other:

We can see that acousticness and energy are strongly negatively correlated, while year and popularity are strongly positively correlated. Let’s plot these two sets of features along with lines of best fit to better show their relationship:

Now we can visually see the negative correlation between the energy and acousticness of songs. Acoustic music lacks electrical amplification, and we generally relate this to more gentle and slow music. On the other hand, we can think of energy as a measure of the intensity and activity of a track. Based on these definitions, a negative correlation checks out.

There is a strong correlation between popularity and year in this dataset. This makes sense, as if a song has just been released, we would expect many people to listen to it right away on Spotify. As for older songs, they may fall out of relevancy (so people listen to it less and less) or people may listen to the track by other means, such as record players and DVDs.

Finally we’ll take a look at a phenomenon called the loudness war, which is a term coined to represent the ever-increasing loudness of music.

Note: to give a better idea of average overall loudness, 0 is the limit at which music starts to clip as the sound is maxed out.

As expected, the correlation between loudness and year is close to 1. On average, 2020 has been the loudest year in the past century!

Conclusion

That’s all for my Spotify data analysis.

If you want to learn more about what you can do with this dataset, check out my article on predicting Spotify song popularity here:

Predicting Spotify Song Popularity with Machine Learning

What AI has to say about what makes a song a smash hit

medium.com