Are Music Preferences of Neighboring Countries Similar? Network Analysis on Spotify Data

Melih Ekinci
9 min readFeb 23, 2022

--

Intuitively, one can think neighbouring countries have similar taste for music. How can we show, indeed, this is the case? In this analysis, we will utilize network analysis on the dataset scraped from Spotify.

We will use top 200 chart for every country listed on Spotify. After gathering several features regarding these songs, a bipartite network will be constructed on songs and countries using Networkx library. After that, unipartite projection of network will indicate number of common songs between countries listed on their top 200 charts- weighted with the rank of song. Thus, we will be able identify whether song preference is strong with countries that are geographically close.

Analysis is composed of two steps:

  1. Gathering Dataset From Spotify
  2. Network Analysis

First Part: Data Collection

Spotify provides an API for those who wish gather data from their charts and songs. You can register and get the credentials for API in this link:

Then, we have to scrap the URLs for each song listed on Top 200 Charts using BeautifulSoup. There are several guides for BeatifulSoup and Selenium. So, I’m going straight to code.

pip install selenium
pip install requests
import requests
from bs4 import BeautifulSoup as bs
#Library to start browser
from selenium import webdriver
#Library to access element listed on URL
from selenium.webdriver.common.by import By
#Library to wait loading of URL
from selenium.webdriver.support.ui import WebDriverWait
#Library for conditions
from selenium.webdriver.support import expected_conditions as EC
#For time constraints
import time
#For choosing keys
from selenium.webdriver.common.keys import Keys

I will use Chrome and set driver path.

DRIVER_PATH = '/Users/adria/chromedriver'
driver = webdriver.Chrome(executable_path=DRIVER_PATH)
driver.get('https://google.com')

Examining one of daily top songs charts.

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
options = Options()
options.headless = True
options.add_argument("--window-size=1920,1200")
driver = webdriver.Chrome(options=options, executable_path=DRIVER_PATH)
driver.get("https://spotifycharts.com/regional/ar/daily/2021-05-07")
print(driver.page_source)

Chart table image contains the URL for each song.

find_href=driver.find_elements_by_xpath('//td[@class="chart-table-image"]/a')
for my_href in find_href:
print(my_href.get_attribute("href"))

Creating lists for country and dates. In this analysis, we will use top 200 charts from end of months between January 2017 and April 2021.

countrylist=["global","us" ,"gb","ae","ar","at","au","be","bg","bo","br","ca","ch","cl","co","cr","cy","cz","de","dk","do","ec","ee","eg","es","fi","fr","gr","gt","hk","hn","hu","id","ie","il","in","is","it","jp","kr","lt","lu","lv","ma","mx","my","ni","nl","no","nz","pa","pe","ph","pl","pt","py","ro","ru","sa","se","sg","sk","sv","th","tr","tw","ua","uy","vn","za"]
datelist=['2017-01-31','2017-02-28','2017-03-31','2017-04-30','2017-05-31','2017-06-30','2017-07-31','2017-08-31','2017-09-30','2017-10-31','2017-11-30','2017-12-31','2018-01-31','2018-02-28','2018-03-31','2018-04-30','2018-05-31','2018-06-30','2018-07-31','2018-08-31','2018-09-30','2018-10-31','2018-11-30','2018-12-31','2019-01-31','2019-02-28','2019-03-31','2019-04-30','2019-05-31','2019-06-30','2019-07-31','2019-08-31','2019-09-30','2019-10-31','2019-11-30','2019-12-31','2020-01-31','2020-02-29','2020-03-31','2020-04-30','2020-05-31','2020-06-30','2020-07-31','2020-08-31','2020-09-30','2020-10-31','2020-11-30','2020-12-31','2021-01-31','2021-02-28','2021-03-31','2021-04-30']

Downloading URLs via for loop.

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
track=[]
streams=[]
position=[]
date=[]
country=[]
url=[]
for i in range(len(countrylist)):
for j in range(len(datelist)):
link=str("https://spotifycharts.com/regional/")+str(countrylist[i])+str("/daily/")+str(datelist[j])
print(link)
options = Options()
options.headless = True
options.add_argument("--window-size=1920,1200")
driver = webdriver.Chrome(options=options, executable_path=DRIVER_PATH)
driver.get(link)
h1 = driver.find_elements_by_class_name('chart-table-track')
for a in range(1,len(h1)):
b=h1[a].text
track.append(b)
h2 = driver.find_elements_by_class_name('chart-table-streams')
for a in range(1,len(h2)):
b=h2[a].text
streams.append(b)
h3 = driver.find_elements_by_class_name('chart-table-position')
for a in h3:
b=a.text
position.append(b)

find_href=driver.find_elements_by_xpath('//td[@class="chart-table-image"]/a')
for my_href in find_href:
url.append(my_href.get_attribute("href"))
for a in range(1,len(h2)):
country.append(countrylist[i])
for a in range(1,len(h2)):
date.append(datelist[j])
driver.quit()df2 = pd.DataFrame(list(zip(date,country,track, streams,position,url)),columns =['Date','Country','Track Name', 'Streams','Position','URL'])
df2

Creating dataframe and save it as csv file.

df2 = pd.DataFrame(list(zip(date,country,track, streams,position,url)),columns =['Date','Country','Track Name', 'Streams','Position','URL'])
df2
df2.to_csv('hamdata.csv', index=False)

Now, we will utilize a Python Library called Spotipy in order to gather attributes regarding song such as danceability, tempo, artist popularity etc.

pip install spotipy --upgrade

We will import it and use credentials which we obtained from SpotifyDevelopers.

import spotipy
from spotipy.oauth2 import SpotifyClientCredentials
df2=pd.read_csv("hamdata.csv")urls=df2["URL"].unique().tolist()
danceability={}
energy={}
key={}
loudness={}
mode={}
speechiness={}
acousticness={}
instrumentalness={}
liveness={}
valence={}
tempo={}
artistid={}
artistgenre={}
artistpopularity={}
for url in urls:
print(url)
client_id = "Enter Client ID"
client_secret = "Enter Client Password"
client_credentials_manager = SpotifyClientCredentials(client_id=client_id, client_secret=client_secret)
sp = spotipy.Spotify(client_credentials_manager=client_credentials_manager) #spotify object to access API

#audio features
audiofeatures = sp.audio_features(url)
danceability[url]=audiofeatures[0]["danceability"]
energy[url]=audiofeatures[0]["energy"]
key[url]=audiofeatures[0]["key"]
loudness[url]=audiofeatures[0]["loudness"]
mode[url]=audiofeatures[0]["mode"]
speechiness[url]=audiofeatures[0]["speechiness"]
acousticness[url]=audiofeatures[0]["acousticness"]
instrumentalness[url]=audiofeatures[0]["instrumentalness"]
liveness[url]=audiofeatures[0]["liveness"]
valence[url]=audiofeatures[0]["valence"]
tempo[url]=audiofeatures[0]["tempo"]

#genre
artistid[url]=[sp.track(url)["artists"][i]["id"] for i in range(len(sp.track(url)["artists"]))]

for i in artistid.values():
for id in i:
artistgenre[id]=sp.artist(id)["genres"]

for i in artistid.values():
for id in i:
artistpopularity[id]=sp.artist(id)["popularity"]

Gathering dictionaries to build final dataframe.

Spotify = pd.DataFrame(data=urls, columns=['URL'])Spotify["ArtistID"]=""
for k,v in artistid.items():
idx=Spotify[Spotify.URL==k].index[0]
Spotify.iloc[idx,1]=v[0]
Spotify["ArtistGenre"]=""
for k,v in artistgenre.items():
idx=Spotify[Spotify.ArtistID==k].index
for i in idx:
Spotify.iloc[i,2]=str(v)
Spotify2=Spotify.copy()features=[danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo]
featuresColumns=['danceability','energy','key','loudness','mode','speechiness','acousticness','instrumentalness','liveness','valence','tempo']
c=2
d=0
for feature in features:
print(featuresColumns[d])
c+=1
Spotify2[featuresColumns[d]]=""
for k,v in feature.items():
idx=Spotify[Spotify.URL==k].index[0]
Spotify2.iloc[idx,c]=str(v)
d+=1
Spotify2["ArtistPopularity"]=""
for k,v in artistpopularity.items():
idx=Spotify2[Spotify2.ArtistID==k].index
for i in idx:
Spotify2.iloc[i,14]=str(v)
dataset=pd.merge( df2, Spotify2, how="inner", on='URL')

Second Part: Network Analysis

Let’s import& install important libraries. We will use NetworkX, node2vec libraries for parts related to network.

!pip install node2vec
!pip install plotly-express
!pip install birankpy
import pandas as pd
import numpy as np
import birankpy
import matplotlib.pyplot as plt
from random import sample
import random
import networkx as nx
from node2vec import Node2Vec
import seaborn as sns
from node2vec.edges import HadamardEmbedder
from node2vec.edges import WeightedL1Embedder
from node2vec.edges import WeightedL2Embedder
from node2vec.edges import AverageEmbedder
from networkx.algorithms import community as nxcomm
from networkx.algorithms import bipartite
bn = birankpy.BipartiteNetwork()

After reading data, I dropped null values if there is any.

df1=pd.read_csv('dataset3.csv')
# dropping null values if any
df2=df1.dropna()

Artist name, track name and year of the song are needed to be derived in the dataset.

#Creating track and artist names
df2["Track"]=df2["Track Name"].apply(lambda x:x.split(" by ")[0]).str.strip()
df2["Artists"]=df2["Track Name"].apply(lambda x:x.split(" by ")[1]).str.strip()
# Year
df2["Year"]=df2["Date"].apply(lambda x:x.split("-")[0]).str.strip()

Some songs have several artists collaborated, I created 3 different artist name features for that.

# Artist names for collaborations
df2["Artist1"]=df2["Artists"].apply(lambda x:x.split(",")[0]).str.strip()
df2["Artist2"]=df2["Artists"].apply(lambda x: x.split(",")[1] if len(x.split(","))>1 else False).str.strip()df2["Artist3"]=df2["Artists"].apply(lambda x: x.split(",")[2] if len(x.split(","))>2 else False).str.strip()

Other column transformation such as getting stream feature and change of feature type is needed.

#Others
df2["Streams"]=df2["Streams"].apply(lambda x:x.replace(",", ""))
df2["Streams"]=df2['Streams'].astype('float')
df2["Track"]=df2["Track"].apply(lambda x:x.strip())
df2.Year = df2.Year.astype('int64')

So, here is a view of dataset.

Constructing Dataset for Network

We need to create a dataset that includes song name, artist name and the country name. If the song is listed on the chart of that country, 1 will assigned to the column, 0 otherwise.

Example: Shape of You, USA, 1

However, the rank of the song in the chart matters too. In order to weigh, we multiply with 200 and divide by its position. So, as position decreases in the chart, weight of the song increases as it is more popular.

data=pd.DataFrame(columns=['Track','Artists','WeightedOne','Country', 'Year'])for year in df2["Year"].unique().tolist():
for country in df2["Country"].unique().tolist():
firstdf=df2[(df2["Country"]==country) & (df2["Year"]==year)]
firstdf["One"]=1
firstdf["WeightedOne"]=(firstdf["One"]*200)/firstdf["Position"]
seconddf=firstdf[["Track","Artists","WeightedOne"]].groupby(by=["Track","Artists"]).sum().reset_index().sort_values(by=["WeightedOne"],ascending=False)[:50]
seconddf["Country"]=country
seconddf["Year"]=year

data=data.append(seconddf)

Creating a subset of dataset with only song name, country and its weight.

data["Sum"]=np.sum(data.WeightedOne)
data["NormalizedWeight"]=data["WeightedOne"]/data["Sum"]
data["WeightedOne2"]=data["NormalizedWeight"]*100
datadeneme=data[["Track","Country","WeightedOne2"]]

So here is the view of dataset.

Networkx requires dataset to be in numbers and names of nodes should be given in dictionaries. So, last step before constructing network is to prepare appropriate dictionaries and dataset.

Mapping songs and countries into number lists.

nodenumber=pd.DataFrame(datadeneme["Track"].unique().tolist(),columns=['Track'])
nodenumber["TrackNumber"]=nodenumber.index+1
nodenumber2=pd.DataFrame(datadeneme["Country"].unique().tolist(),columns=['Country'])
nodenumber2["CountryNumber"]=range(4248,4316)

Creating dataset in terms of mapped numbers

datadeneme["Track2"]=0
for i in range(len(datadeneme)):
datadeneme.iloc[i,3]=nodenumber[nodenumber.Track==datadeneme.Track.values[i]]["TrackNumber"].values[0]
datadeneme["Country2"]=0
for i in range(len(datadeneme)):

datadeneme.iloc[i,4] =nodenumber2[nodenumber2.Country==datadeneme.Country.values[i]]["CountryNumber"].values[0]
dataorig=datadeneme[["Track2","Country2","WeightedOne2"]]
dataorig=dataorig.rename(columns={ "WeightedOne2": "weight"})

Here is a picture of mapped dataset.

Now, node types should be given a list, so the algorithm can understand which node is song and which node is country.

nodetypes=pd.DataFrame(range(1,4316),columns=['node'])
nodetypes["is_country"]=0
for i in range(len(nodetypes)):
if nodetypes.node[i]>=4248:
nodetypes["is_country"][i]=1

Preparing the dictionaries for node names. When network is visualized, names of nodes can be seen.

d={}
for i in range(len(nodenumber2)):
d[nodenumber2["CountryNumber"][i]]=nodenumber2["Country"][i]
d2={}
for i in range(len(nodenumber)):
d2[nodenumber["TrackNumber"][i]]=nodenumber["Track"][i]

Here is a picture for dictionary that contains country names.

So, here comes the big part. Let’s construct our graph by feeding dataset and dictionaries into networkx algorithm.

G_orig = nx.from_pandas_edgelist(dataorig, source = 'Country2', target = 'Track2',edge_attr=True,  create_using=nx.Graph())
print(nx.info(G_orig))
print("Is bipartite: ", nx.is_bipartite(G_orig))
print("Is connected: ", nx.is_connected(G_orig))
print("Is weighted: ", nx.is_weighted(G_orig))

Setting notetype dictionary using nodetypes list.

nodetype_dict = nodetypes.set_index('node').to_dict('index')
nodetype_dict
nx.set_node_attributes(G_orig, nodetype_dict)countries = [x for x,y in G_orig.nodes(data=True) if y['is_country']==1]
tracks = [x for x,y in G_orig.nodes(data=True) if y['is_country']==0]

Visualizing our bipartite network. G_orig graph is fed to visualization layout, labels is set “d” which is our dictionary for country names. In addition, the width of edges is weighted according to links’ weight.

plt.rcParams['figure.figsize'] = [25, 15]pos = nx.bipartite_layout(G_orig, countries, align = 'vertical')
nx.draw(G_orig,pos, width = [d["weight"] for (u, v, d) in G_orig.edges(data=True)], node_size = 0.5,labels=d,font_size=15,font_color='r')

There are important network statistics such as centrality scores for nodes, modularity and density of network. Let’s look at them.

dgc = nx.degree_centrality(G_orig)
cls = nx.closeness_centrality(G_orig)
btw = nx.betweenness_centrality(G_orig,weight='weight')
cr = pd.DataFrame(index=G_orig.nodes())
cr['dgc'] = cr.index.map(dgc)
cr['cls'] = cr.index.map(cls)
cr['btw'] = cr.index.map(btw)
cr["name"]=""
for i in cr.index:
if i>=4248:
cr["name"][i]=d[i]
else:
cr["name"][i]=d2[i]
nx.bipartite.density(G_orig,G_orig.nodes())
nxcomm.quality.modularity(G_und, communities = kl_res)

Now, in order to calculate likeliness of countries in terms of music preference, unipartite projection of bipartite graph is needed. Let’s do that. We can do it for either for song, or countries. I will choose projection onto countries. Commong songs in the top 200 list of countries’ are counted. This will be used as an indicator for similarity of music preferences.

country_net = bipartite.collaboration_weighted_projected_graph(G_orig, countries)
print(nx.info(country_net))
print("Is bipartite: ", nx.is_bipartite(country_net))
print("Is connected: ", nx.is_connected(country_net))
print("Is weighted: ", nx.is_weighted(country_net))
track_net = bipartite.collaboration_weighted_projected_graph(G_orig, tracks)
print(nx.info(track_net))
print("Is bipartite: ", nx.is_bipartite(track_net))
print("Is connected: ", nx.is_connected(track_net))
print("Is weighted: ", nx.is_weighted(track_net))

Getting edges and nodes from unipartite projection.

a=list(country_net.edges(data=True))country1=[]
country2=[]
weight=[]
for i in range(len(a)):
country1.append(a[i][0])
country2.append(a[i][1])
weight.append(a[i][2]['weight'])

Preparing pandas datasets in terms of edges, nodes and its similarity scores.

similarcountries=pd.DataFrame({'country1':country1,'country2':country2,'weight':weight})similarcountries["country1_name"]=""
for i in similarcountries.country1.values:
idx=similarcountries[similarcountries.country1==i].index
similarcountries.iloc[idx,3]=d[i]
similarcountries["country2_name"]=""
for i in similarcountries.country2.values:
idx=similarcountries[similarcountries.country2==i].index
similarcountries.iloc[idx,4]=d[i]

Here’s a preview of projected graph. Highest similarity between two countries is Germany and Austria. Second one is Russia and Ukraine. It appeals our intuitions.

Let’s visualize it.

core_nodes = max(nx.connected_components(country_net), key=len)
core = country_net.subgraph(core_nodes)
nodes = core.nodes()
degree = core.degree()
colors = [degree[n] for n in nodes]
pos = nx.spring_layout(core,weight='weight')cmap = plt.cm.viridis_r
cmap = plt.cm.Greys
vmin = min(colors)
vmax = max(colors)
fig = plt.figure(figsize = (25,20), dpi=50)nx.draw(core,alpha = 0.8, pos=pos,nodelist = nodes, node_color = 'w', with_labels= True,font_size = 25, width = [d["weight"] for (u, v, d) in core.edges(data=True)], cmap = cmap, edge_color ='yellow',labels=d)
fig.set_facecolor('#0B243B')

plt.show()

Apparently, there are two communities in music tastes. Left one is consist of South American countries. Second sphere is mainly European countries while some countries in outer periphery such as Brazil, Turkey, Russia. Width of links indicates strength of similarity of music preferences between two countries.

--

--

Melih Ekinci

Data Analyst | MSc Candidate in Artificial Intelligence