ML with Spotif(p)y

Python machine learning with audio features

Leyli Ramazanova
Analytics Vidhya
5 min readNov 16, 2019

--

Introduction

I started this project with having to google “what is machine learning”. Then, I moved on to the traditional ml tutorials with the Iris dataset. Now, here I am trying to predict whether I will like a song based on its audio features.

Data acquisition

The obvious fact is that machine learning is dependent on data. So, let us begin with retrieving the data from Spotify.

The first step to get data is to register for an access token with Spotify.

So far, there are three crucial components: Client ID, Client Secret Key, and the username. The first two will be given once an access token is acquired. The last is a user ID which you can get from your profile settings. Once that is out of the way, we have access to Spotify’s data about albums, tracks, and artists directly from the Spotify Data Catalogue.

import spotipy
import spotipy.util as util
from spotipy.oauth2 import SpotifyClientCredentials
#initialise a client credentials managercid = ""
secret = ""
username = ""
client_credentials_manager = SpotifyClientCredentials(client_id=cid, client_secret=secret)
sp = spotipy.Spotify(client_credentials_manager=client_credentials_manager)
playlists = sp.user_playlists(username)

At this point, we have a Spotify object to collect data online. Next step is to create a data frame with track URIs. URIs are Uniform Resource Indicators — treat it as you magic gateway to detailed information about a playlist or a song.

def get_playlist_tracks(username, playlist_id):
tracks_list= []
results = sp.user_playlist(username, playlist_id,
fields="tracks,next")
tracks = results['tracks']
while tracks:
tracks_list += [ item['track'] for (i, item) in
enumerate(tracks['items']) ]
tracks = sp.next(tracks)
return tracks_list
def get_playlist_URIs(username, playlist_id):
return [t["uri"] for t in get_playlist_tracks(username,
playlist_id)]

Making data usable

Using a list of URIs, we can acquire audio features of songs. In this case, I created two playlists: songs I like, and songs I don’t like — therefore, two URI lists. Because each list is longer than 50, and sp.audio_features works on 50 elements at a time, we have to create a splitlist function that will split a list into a list of 50-element lists. Once done, I traverse through this list, and save audio features into a data frame.

#modified get features functiondef get_audio_features (track_URIs) :
features = []
r = splitlist(track_URIs,50)
for pack in range(len(r)):
features = features + (sp.audio_features(r[pack]))
df = pd.DataFrame.from_dict(features)
df["uri"] = track_URIs
return df

It’s important to note that this function returns a range of audio features of which I only choose ones I think are relevant to my project. A more detailed description of each feature can be found at:

https://developer.spotify.com/documentation/web-api/reference/tracks/get-several-audio-features/

[‘danceability’,’acousticness’,’energy’,’instrumentalness’,’speechiness’,’tempo’,’valence’]

Another crucial component is tags. Once I have two data frames — one with audio features of songs I like, and the other with audio features of songs I don’t like — I add a column ‘target’ to both and fill it with 1s for songs I like and 0s for songs I don’t like. This way, once the two data frames are concatenated, there is a distinction between two categories of songs.

training_data = pd.concat([good_features_x,bad_features_x], axis=0, join='outer', ignore_index=True)

Visualising data

Previously, I looked at features and took ones I think were relevant. Now, I want to verify in some way which features actually have significant differences between the two playlists. So, I plot them on the same graph: if I can’t visually see much difference, I do not use the feature in my ML models. For example, having plotted Tempo, I see a significant diffence in the two distributions.

sns.distplot(good_features_df[['tempo']],color='indianred',axlabel='Tempo')
sns.distplot(bad_features_df[['tempo']],color='mediumslateblue')
plt.show()

On the Danceability plot, on the other hand, I deem the difference neglectable.

After all, I settled with the following:

features = ['tempo','acousticness','energy','instrumentalness','speechiness']

Splitting data into train and test

I chose to split my data 80/20: 80% to train the model, and 20% to test it.

train, test = train_test_split(training_data, test_size = 0.2)
x_train = train[features]
y_train = train['target']
x_test = test[features]
y_test = test['target']

Decision Tree Classifier

This model is appropriate for my purpose because I am trying to classify songs into two categories. This is essentially a two-step process: learning and predicting. In other words, there has to be data to train the model, and data to test the model. In slightly more detail, each node represents an audio feature with a “decision rule”, and each leaf in the node is the decision based on the rule. Decision in this case is a number in relation to each feature that will be used to split songs into categories.

dtc = DecisionTreeClassifier()
dt = dtc.fit(x_train,y_train)
y_pred = dtc.predict(x_test)
score = accuracy_score(y_test, y_pred) * 100

K Neighbors Classifier

In this model points are slowly clustered. Given a point A, the model finds the closest neighbor. Then, this neighbor “votes” on where to classify point A. This is in the simplest scenario; in the alternative scenario, the number of votes can be changed and because of the nature of voting, it should be odd.

knc = KNeighborsClassifier(5)
knc.fit(x_train,y_train)
knn_pred = knc.predict(x_test)
score = accuracy_score(y_test, knn_pred) * 100

Principal Component Analysis (PCA)

This unsupervised learning technique converts multi-dimensional data to low-dimensional data by parsing through feature variance and selecting a feature with the most variance. Earlier, I did that manually by looking at individual plots. This analysis does it for me. Hence, one of the inputs into the model tuning is the number of components used to train it. After some experimenting, I found that the ideal number for my scenario is 3.

sc = StandardScaler()
X_train = sc.fit_transform(x_train)
X_test = sc.transform(x_test)
pca = PCA(n_components=3)
classifier = RandomForestClassifier(max_depth=2, random_state=0)
classifier.fit(X_train, y_train)
y_pred = classifier.predict(X_test)print("Accuracy using the PCA model is: ", accuracy_score(y_test, y_pred), "%")

Challenges

One of the first challenges is sitting down to compile a playlist of songs I like. Even then, my playlist is only 150 songs — but the more the better. Similarly, one should put in effort into creating a playlist of songs not liked as opposed to adding random songs which is what I did.

The other challenge was extracting the data from my account, and then from my playlists and deciding what features are worth using.

It took me a while to realise that I need to somehow label the songs that I like, and I don’t like before concatenating the two data frames.

Surprisingly, running the various models on my data was one of the easiest parts of the project. However, I do acknowledge that my models are poorly tuned because (1) I do not have a good understanding of machine learning in general, and (2) I do not have a grasp on components that go into the models.

Conclusion

I was able to achieve the highest accuracy with the PCA- 86.6%. The Tree Classifier yielded an average of 82% accuracy, and the KNN model yielded an average of 73% accuracy.

The accuracy can be improved by (1) spending more time on tuning models,(2)putting effort into creating both playlists, and (3)having more than a total of 300 data points.

--

--