Profiling Songs on Spotify Using Cluster Analysis

Published in

Analytics Vidhya

5 min readJan 15, 2020

The Top Ten Streamed Artists of the Decade

Music is known to be our universal language. Whether you are relaxing on the porch or going out for a jog, music sets our mood up for any type of task.

While there are many playlists for different types of moods or adventures, I was interested in what elements could be used to categorize these songs into different moods. After reading, John Kohs’ article, I became inspired to use an unsupervised model for my project. Using Spotify API, I was able to extract songs from the recent chosen top 10 streamed artists of the decade (Drake, Eminem, The Weeknd, Ed Sheeran, Post Malone, Sia, Beyonce, Rihanna, Taylor Swift, and Ariana Grande) and cluster their songs based on their audio features.

Let’s get started!

Extracting Data Procedure:

Setting up Spotify API

https://developer.spotify.com/
Log in/ Sign up
Go to Dashboard -> Create Client ID -> Create App

Extracting Tracks from Playlist using Python

Importing libraries

import spotipy
from spotipy.oauth2 import SpotifyClientCredentials
import pandas as pd
import time

Inputting credentials: client_id and client_secret

spotify_client_id = ''
spotify_client_secret  = ''
client_credentials_manager = SpotifyClientCredentials(client_id=spotify_client_id, client_secret=spotify_client_secret)
sp = spotipy.Spotify(client_credentials_manager=client_credentials_manager)

Retrieving each track from the playlist

def getPlaylistTrackIDs(user, playlist_id):
        ids = []
        playlist = sp.user_playlist(user, playlist_id)
        for item in playlist['tracks']['items']:
            track = item['track']
            ids.append(track['id'])
        return ids
ids = getPlaylistTrackIDs('playlist name', 'playlist id')

Retrieving additional information and features of each track

def getTrackFeatures(id):
    meta = sp.track(id)
    features = sp.audio_features(id)
    name = meta['name']
    album = meta['album']['name']
    artist = meta['album']['artists'][0]['name']
    release_date = meta['album']['release_date']
    length = meta['duration_ms']
    popularity = meta['popularity']
    acousticness = features[0]['acousticness']
    danceability = features[0]['danceability']
    energy = features[0]['energy']
    instrumentalness = features[0]['instrumentalness']
    liveness = features[0]['liveness']
    loudness = features[0]['loudness']
    speechiness = features[0]['speechiness']
    tempo = features[0]['tempo']
    time_signature = features[0]['time_signature']
    track = [name, album, artist, release_date, length, popularity, danceability, acousticness, danceability, energy, instrumentalness, liveness, loudness, speechiness, tempo, time_signature]
    return track

Importing tracks into a dataset

# loop over track ids to create dataset
tracks = []
for i in range(0, len(ids)):
    time.sleep(.5)
    track = getTrackFeatures(ids[i])
    tracks.append(track)

df = pd.DataFrame(tracks, columns = ['name', 'album', 'artist', 'release_date', 'length', 'popularity', 'danceability', 'acousticness', 'danceability', 'energy', 'instrumentalness', 'liveness', 'loudness', 'speechiness', 'tempo', 'time_signature'])
df.to_csv("artist1.csv", sep = ',')

Below is a sample of what the dataset looks like:

Audio Features Description:

Acousticness: A confidence measure from 0.0 to 1.0 of whether the track is acoustic. 1.0 represents high confidence the track is acoustic.
Danceability: Danceability describes how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity.
Instrumentals: Predicts whether a track contains no vocals.
Energy: A measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity. Typically, energetic tracks feel fast, loud, and noisy. For example, death metal has high energy, while a Bach prelude scores low on the scale.
Speechiness: detects the presence of spoken words in a track. The more exclusively speech-like the recording (e.g. talk show, audiobook, poetry), the closer to 1.0 the attribute value. Values above 0.66 describe tracks that are probably made entirely of spoken words.

Distribution of audio features:

What category does each song belong too?

Challenge: Perform K-means Clustering analysis to group songs into categories based on the audio features that they share. The goal is to have the points in the same cluster very close to one another.

Procedure:

Number of clusters/categories = 5
Elbow method: An alternative technique in determining how many clusters to utilized

Visualizing our Clusters

Using t-Distributed Stochastic Neighbor Embedding (t-SNE) to visualize clusters.

Distribution of each cluster

Labeling Clusters

Top 10 Streamed Artist of the Decade Playlist
Cluster 1: Roadtrip Playlist: High in danceability, energy, and speechiness (lyrical)
Cluster 2: Time to Relax: high in acousticness, danceability, and energy
Cluster 3: Let’s Party: High in both danceability and energy
Cluster 4: Energy Booster: Highest in energy
Cluster 5: GoodMorning: Great songs to wake up and get you motivated (High instrumentalness and energy)

Conclusion:

SourceCode; Tableau Dashboard

Using Spotify API, we have extracted the data necessary to perform our analysis in categorizing music based on the features they shared. Clustering analysis was performed. After, we visualized the distribution and amount of songs in each cluster, where we created labels based on their most prevalent audio feature. We were able to gain insight on profiling these songs into different types of moods and tasks.

Profiling Songs on Spotify Using Cluster Analysis

Extracting Data Procedure:

Written by Drucila LeFevre