Predicting Mandarin Song Popularity on Spotify — Part 1

5 min readNov 21, 2022

Introduction

Spotify, one of the most well-known music streaming platforms, is my most used app in my daily life. As a musicophile, I created more than ten playlists in genres such as Pop music, Hip hop, Rock, and Jazz. However, I always ended up with my Mandarin playlist when I needed music the most. One day I came across Spotify Web API documents by accident and was fascinated by the audio features that Spotify tracks. The following image shows the audio features response when calling Get /audio-features.

For details on each parameter, please refer to the API reference

There was another parameter that caught my attention, which is popularity.

To me, popularity is comparative. However, Spotify quantifies each track’s popularity into numbers. How Spotify tracks every aspect of music, from the artist’s popularity to a track’s loudness as data, gave me a new perspective on music. This made me wonder, is it possible to create golden tracks solely following the other popular track’s features? To find out the answer to this question. I decided to dive deeper into how music and data interplay. But first, we need to find out what makes a track popular. In this article, I am only focusing on how the audio features play a role in affecting a song’s popularity.

Before starting this experiment, I browsed through the Internet to see if anyone else had done a similar exercise. There are other experiments on predicting a song’s popularity from audio features, such as this article. The dataset this article used was from Kaggle. However, I’ll like to focus on mandarin songs, and Kaggle doesn’t have a mandarin song popularity dataset. Therefore, I extracted the Spotify dataset manually from Spotify Web API.

Data Extraction/Exploration

I collected 3000 song attributes in the genre ‘mando pop’ from 2019 to 2022 as the dataset. First, I get the Bearer token from Spotify Web API Console for authentication purposes.

auth_token = "BQAbNoApNP48x6r_2G5qfAUFXzgEUIqpUxhYZx7WpETfpg77wXdzpxgkXYdQaIx9MqmnfKzGq3o0Om3O4F2JwBRORPJK-GwxbswcKg66Wf-7x5blrDWKy8CUmC-_eDZ4BjQTFZDV2Z9W1rsNbj1zlmK179d8SyuyXzkyAMFP96n3J2Nc"
headers = {
    'Authorization': 'Bearer {token}'.format(token=auth_token)
}

Then, I created a list of 3000 song ids, 1000 per year, with Spotify Web API. I then extract songs’ attributes with this list with Spotify Web API.

Let's take a deep dive into the data I’ve just extracted.

For song info data I extracted:

Song ID (Track’s Spotify ID) Track name (The name of the track)
Main Artist ID (Main artist’s Spotify ID)
Artist Name (Main artist’s name)
Album Name (The album on which the track appears)

For song attributes data I extracted:

Track Popularity (The popularity of the track. The value will be between 0 and 100, with 100 being the most popular)
danceability (A value of 0.0 is the least danceable and 1.0 is the most danceable)
energy (Energy is a measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity)
key (The key the track is in. Integers map to pitches using standard Pitch Class notation. E.g. 0 = C, 1 = C♯/D♭, 2 = D, and so on. If no key was detected, the value is -1)
loudness (The overall loudness of a track in decibels (dB). Values typically range between -60 and 0 db)
mode (Mode indicates the modality (major or minor) of a track. Major is represented by 1 and minor is 0)
speechiness (Speechiness detects the presence of spoken words in a track. 1.0 is the most speech-like and 0.0 is the most non-speech-like)
acousticness (1.0 represents high confidence the track is acoustic and 0.0 represents high confidence the track is non-acoustic)
instrumentalness (Predicts whether a track contains no vocals. 1.0 represents the track contains no vocals)
liveness (Detects the presence of an audience in the recording. Higher liveness values represent an increased probability that the track was performed live)
valence (A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry))
tempo (The overall estimated tempo of a track in beats per minute (BPM))
duration_ms (The duration of the track in milliseconds)
time_signature (An estimated time signature. The time signature ranges from 3 to 7 indicating time signatures of “3/4”, to “7/4”)

As I mentioned earlier, we are only focusing on how the song attributes (generated by the Spotify algorithm) interplay with the song’s popularity. Therefore, song info will be dropped for this project.

df = df.drop(columns=['track_id','track_name','artist_id','artist_name','album'])

Although the remaining columns are all numerical values, some are continuous variables, such as energy, which is a measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity. Some are discrete variables, such as keys, with integers mapped to pitches using standard Pitch Class notation. There is one categorical variable — mode. For this project, I chose to use regression models. In other words, continuous and discrete variables won’t need extra pre-processing. For mode, which only contains 0 and 1, it will not need to be dummified.

I used StandardScaler to remove the mean and scale each variable to unit variance for track_popularity, key, and mode due to their significant variances.

need_scale_df = df.drop(columns=['track_popularity', 'key', 'mode'])
scale = StandardScaler()
df_scaled = pd.DataFrame(scale.fit_transform(need_scale_df),
                                columns=need_scale_df.keys())

Let’s do some data exploration! First, I graphed and described the distribution of our Y value — track_popularity.

sns.displot(data=final_data['track_popularity'],  kind='kde',
            palette='cool', height=5, aspect=1.4).set(title='Track Popularity Distribution')

The distribution of track_popularity’s peak is around 20 and right-skewed. In other words, the dataset is imbalanced.

final_data['track_popularity'].describe()

Next, I plotted the heatmap of the correlations between variables.

plt.figure(figsize=(20, 10))
sns.heatmap(final_data.corr(),annot = True)

From this heatmap, we do not see any clear indication of a strong correlation between track_popularity and other variables. Let’s have a better look at each variable's correlation value with track_popularity.

abs(final_data.corr()['track_popularity']).sort_values()

I decided to drop variables with an absolute correlation value lower than 0.05 to remove noise and increase the accuracy of the models.

Now data exploration and feature engineering has been completed, I will go on with modeling training in Part 2.

Here is the Python Notebook for this project.

Predicting Mandarin Song Popularity on Spotify — Part 1

Introduction

Data Extraction/Exploration

Written by Ryan Lu