What kind of music do you listen to? — Exploring the network of Spotify’s genres
Spotify is one of the most commonly used music streaming platforms in the world today. It has one of the largest digital collections of music in the world, and employs a variety of algorithms to tag, quantify and classify its data to use in its recommendation algorithms. In other words, it’s a gold mine for interesting data. Spotify has it’s own developers API and a corresponding python library which is really easy to use and well documented. I was especially interested in the genre tags that were given to each artist in it’s database and wanted to use this data to explore how different genres are related to each other and try to detect underlying communities of genres.
For this analysis, I use data from Spotify API and use the iGraph package Python to extract and analyze the data. I also use Gephi, an open-source graph visualization software to visualize and analyze the network of genres. This post covers a high level overview of the analysis and includes the most interesting findings. For a more detailed look at the code and the network measures, refer to the code here. https://github.com/manasahariharan/Genres-Network
The final interactive network is here (Best viewed on a Laptop Browser): https://manasahariharan.github.io/Spotify-Artists-Network/
The Data
So the first question is, how do I sample a large enough sample of artists from the API? Spotify doesn't give out all it’s artists data in one command. I need to know the name, or id of an artist to retrieve their information. So to get our data, I need to have a list of artists names or ids, and the list should be unbiased, it cannot just cover one particular genre or era. There are many ways to get this sample, I sampled the songs that were included in any of spotify’s own playlists (Playlists created by spotify and not by any individual users) and gathered data on the artists of all these songs.
I also used the related artists function on spotify. This function gives up to 20 related artists for any artist we provide as input, This way, I was able to increase the size of my sample and have data on 85283 unique artists. For each of these artists, the dataset includes their Id, name, number of followers, their popularity ( a metric calculated by Spotify’s algorithm), and a list of tags that indicate what genres they fall under.
This genres columns is what we need for this analysis. Since each artists usually falls under multiple genres, we can use this column to build a co-occurrence matrix. This matrix will tell us how often any two genres occur together. Co-occurrence matrices are usually very sparse. Our dataset has 4174 genre tags (Can you believe that there are that many genres out in the world!) so our co-occurrence matrix has 4174 rows and columns. Keep in mind, since this is only a sample of spotify’s data, this isn’t all the genre tags that exist in spotify, Every Noise at Once is a project that collects and samples all data on genres and estimates 4422 tags in spotify so far.
Understanding and Visualizing the network:
Using the iGraph package in Python, I built a network where each genre is a vertex and there is an edge for every time 2 genres were assigned to the same artist. I also assigned edge weights, so if “Hip-Hop” and “Rap” were both used to describe 100 artists, the edge connecting “Hip-Hop” and “Rap” will have an edge weight of 100. Edges with higher weights can be understood as those 2 genres being more closely related to each other. Below is the distribution of Degrees in my genres network along with other basic stats about the network. This is a pretty standard degree distribution for real world networks, most vertices have a low degree, and there are very few vertices which have a high degree, these are known as hubs. Other variables like Betweenness Centrality and Page Rank are investigated further in the jupyter notebook.
Due to the large size of the network, the actual network is difficult to visualize with just iGraph and Python, hence I used Gephi, an open-source graph visualization platform that can handle really big graphs and has plug-ins to create an interactive website based on my network. I used weighted degree to decide the size of the nodes and applied a community detection algorithm to try and detect the underlying community structure of our graph. the nodes are colored according to the communities they were assigned to.
The layout of the graph is ForceAtlas2 which is a gephi layout that treats the nodes as charges that get attracted to each other the more times they occur together. So if 2 nodes are very far away from each other, we can assume that they have very little in common with each other and vice-versa. This graph has limited resolution, to view the full interactive network, go here:
Insights from the Graph
From first glance, it seems that the community detection algorithm has done a pretty decent job. If I ever ever want a simpler classification system for further analysis of artists, I could possibly use the Community IDs instead of the individual genre tags. This would reduce the number of unique categories in my dataset, making further analysis easier.
There’s Rock and all it’s derivative in one cluster, with the largest node sizes indicating that artists from a variety of genres are placed under the umbrella of rock. This was a bit surprising for me considering that Rock is dead. But it makes sense, this isn’t a database of recent artists, this is a collection of artists stretching back to a century ago, and rock has been the most sustained musical movement.
Next we have Pop, Indie music, and alternative rock and other popular artists who wouldn’t easily fall into the main genres in one bucket. Hip-hop, rap and their derivatives are in their own cluster, along with some of their early influence genres like funk and soul. It’s surprising how small their node sizes considering the impact they have had on our culture the last few decades. It could be because Rap and Hip-Hop are newer than the rock genres and hence have smaller amounts of data, or it could be a sampling error, meaning, Spotify’s playlists, which is the source of this data sample could be focusing more on Rock and Pop over Rap and Hip-Hop, this could be worth investigating, and if true, Spotify would need to shift focus and pull up so artists of all communities and genres get equal spotlight.
We also have metal in it’s own cluster, punk and alternative metal is in 1 one cluster. Electronic music is split into 2 clusters. As I understand, one has the rave kind of electronic genres and the other the elevator music kind. Latin music also has it’s own cluster, so does jazz and it’s derivatives. These are just the main clusters though. Each county’s main genres seem to have their own small clusters like french hip-hop, k-pop, j-pop, Belgian indie rock etc. One such clusters i was specifically interested in was Indian/South Asian music of course, and there’s some really weird genres in that one.
Pictured are the main nodes of the cluster with Indian music, but also grouped in the same cluster are new-agey genres like Healing, psychchill, full on, shamanic, progressive psytrance (Yes these are actual genres according to spotify). Of course. Why am I even surprised.
In that vein, here is just a sample of the weird genres in Spotify: bubblegum dance, escape room, wonky, ninja, deep psychobilly, cowpunk (?!)
Artist Networks
For a deeper look into the dataset, I wanted to also create a network of artists, using spotify’s aforementioned related artists feature as edges(if an artist A turns up as a related artist for artist B, there would be an edge) and the artists as nodes. However, even Gephi couldn’t handle a network with 80k nodes, so instead I decided to look at subset of the artists in the context of the genres network.
I picked 2 example early-ish genres that were knows to influence many newer artists and look at all artists that were given that genre’s tag and see how they were related to each other. In both networks, the artists’ node sizes are decided according to the number of their spotify followers, and the color is indicated by the results of the same community detection algorithm used for the genre networks (the modularity for the 3 networks are in the range of 0.7). These networks are directed, unlike the genres network, meaning each edge has a direction. If artist A is in the top 20 related artists of artist B, then the edge would be going from B to A, and vice versa.
Funk
Funk originated in the 60s as a blend of genres like blues, soul, rhythm and blues and would itself go on to influence many of today’s artists. Classics like Prince , Stevie Wonderand Earth, Wind and Fire are pretty central to the genre, but we also have some new artists like Khruangbin and Gary Clark Jr. who are not strictly funk but are heavily influenced by artists in the genre.
Classic Rock
The community structure seems to be pretty loose here, but we see that the orange cluster is the more experimental (cool) guys of the 60s and 70s, the blue cluster is the more blues oriented artists, the green is essentially Beatles and Co., the pink ones are the quieter side of classic rock and the olive green being the noisier side. Not sure what Pink Floyd is doing in its neighborhood though.
Final Notes
Strictly speaking, this is an analysis no one really needs and something I worked on simply because I was bored. But this was so much fun, especially amidst all the chaos surrounding us right now, going down all these music-related rabbit holes and learning about networks along the way. Hope this was useful/fun for you as well. I actually ended up mining more data than I used in this analysis, so if you think of any other analysis I can do with data on music tracks and artists, let me know!