Do The Data Dance? (pt. 1)

followCrom
4 min readNov 15, 2022

--

Part 1: What Makes a Song Popular?

Doing the Data Dance

Having spent over a decade as a nightclub DJ, I’ve often wondered 🤔: What makes a song a hit? How much do certain audio features - like tempo, key, or loudness - contribute to a song’s popularity?

To help me answer this question, I’m going to use Spotify’s 2021 dataset. It contains information for almost 600k tracks, including numerical values for several audio features, like danceability, energy, key, loudness, tempo, valence, and more.

I’ll combine these with one further feature from the dataset - a popularity score between 0 and 100 (100 being the most popular). According to Spotify, “The popularity score is based on the total number of plays the track has had and how recent those plays are.”

First, let’s see what the popularity distribution is across the dataset.

The above plot is heavily skewed towards the right. This suggests that a large number of songs have low popularity, while only a very small number achieve high popularity.

Spotify have said, “Songs being played a lot now have a higher popularity than songs that were played a lot in the past”. Let’s see how closely popularity and release date correlate.

As we might have guessed, there is a clear correlation between a song’s release date and its popularity. But what about the audio features? The heatmap below reveals that loudness has the strongest correlation with popularity, yet with a coefficient of just 0.3, this is still relatively weak.

Let’s create three separate DataFrames for comparison:
1, 100 most popular tracks
2, 100 least popular tracks
3, all tracks

The values I’ll plot will be the mean measurements of the numeric columns. Don’t include `popularity`, as that value has not been scaled, making it incompatible with the other measurements.

top_features = top_100_df.mean().tolist()[1:]
bottom_features = bottom_100_df.mean().tolist()[1:]
all_features = numerics_df.mean().tolist()[1:]

Now I’ll combine the three dfs into one polar plot.

labels = list(bottom_100_df)[1:]

angles = np.linspace(0, 2*np.pi, len(labels), endpoint=False)
fig = plt.figure(figsize = (12,12))

ax = fig.add_subplot(221, polar=True)
ax.plot(angles, top_features, 'o-', linewidth=2, label = "Top 100 tracks", color= 'blue')
ax.fill(angles, top_features, alpha=0.25, facecolor='blue')
ax.set_thetagrids(angles * 180/np.pi, labels, fontsize = 13, fontstyle='italic')

ax.set_rlabel_position(275)
plt.yticks([0.1 , 0.2 , 0.3 , 0.4, 0.5, 0.6, 0.7, 0.8, 0.9], ["0.1",'0.2', "0.3", "0.4", "0.5", "0.6", "0.7", "0.8", "0.9" ], size=12)
plt.ylim(0,0.8)

ax.plot(angles, bottom_features, 'o-', linewidth=2, label = "Bottom 100 tracks", color= 'orange')
ax.fill(angles, bottom_features, alpha=0.25, facecolor='red')

ax.plot(angles, all_features, 'o-', linewidth=2, label = "All tracks", color= 'green')
ax.fill(angles, all_features, alpha=0.15, facecolor='green')

ax.set_title("How do a track's audio features affect popularity?", fontsize=15, fontweight='bold')
ax.grid(True)

plt.legend(loc='best', bbox_to_anchor=(0.0, 0.0))
plt.show()

Explicitness seems to be more common among popular tracks (the blue plot), but we should be cautious before concluding that explicit songs are more popular. A closer look at the 100 most popular tracks reveals they are all from between 2013 and 2021. This aligns with our understanding of Spotify’s popularity algorithm, which favours recent releases, and newer tracks tend to contain more explicit content.

Animating a Polar Plot with Plotly Express

To visualize the changes over time, I’ll plot the average values for each year. This time I’ll use Plotly Express, so we can animate the plot.

One more step before we animate

Animation requires converting the data from its current wide format to a long format, allowing us to use one column as an animation frame, and the other columns as measured variables. This can be done using pandas.melt().

plot_data = pd.melt(top_100_2_df, id_vars='year', var_name='attribute', value_name='attribute_value')

fig_pol_2 = px.line_polar(plot_data,
r='attribute_value',
theta='attribute',
line_close=True,
animation_frame='year',
title="How do a track's audio features effect popularity?",
template="plotly_dark"
)
fig_pol_2.update_layout(
autosize=True,
polar = dict(
radialaxis_angle = -45),
font=dict(
family="Verdana",
size=14)
)
fig_pol_2.update_polars(radialaxis=dict(visible=True,range=[0, 1]))

fig_pol_2.show()

Despite quite dramatic variances year on year, it seems that in general people prefer danceable, high-energy songs in a major key. After all my years DJing, this tallies with my experience. However, let’s not forget that people’s musical tastes are wide and varied, and Spotify’s popularity algorithm should not be considered the ultimate arbiter of taste!

Find out more:
Part 2: Training a Model to Predict Popularity

--

--