WSDM — KKBox’s Music Recommendation Challenge

Anjar Aquil
10 min readSep 26, 2020

--

“Without music, life would be a mistake” ― Friedrich Nietzsche.

Overview:

  • KKBox is music streaming platform, like gaana.com,saavan.com..,, they have provided their dataset to ML_community through kaggle.com and they want ML_community to build a better music recommendation system using newer algorithms, currently they are using collaborative filtering based algorithms with matrix factorization and word embedding in their recommendation system.
  • https://www.kaggle.com/c/kkbox-music-recommendation-challenge/overview

Table of content:

  1. ML problem statement
  2. Data Discussion
  3. EDA
  4. Feature Engineering
  5. Data Preprocessing
  6. Models
  7. Comparision
  8. Summary & Future work
  9. References

1. ML problem statement:

We have to build the model which will predict whether a user will re-listen to the song or not after song has been listened once by user, our main goal is to suggest user song of his choice, we can pose this problem as classification problem and as deep learning problem also.

2. Data discussion:

Dataset source: https://www.kaggle.com/c/kkbox-music-recommendation-challenge/data

he problem has 6 data files:

1.train.csv: This file includes

user_id (msno), song_id, source_system_tab (where the event was triggered),
source_type (an entry point a user first plays music), source_screen_name (name of the layout user sees) and target ( 1 means there is a recurring listening event(s) triggered within a month after the user’s a very first observable listening event, target=0 otherwise ).

2. test.csv: This file includes

user_id (msno), song_id, source_system_tab (where the event was triggered),
source_type (an entry point a user first plays music) and source_screen_name (name of the layout user sees).

3. songs.csv: This file has features like

song_id, song_length, genre_id, artist_name, composer, lyricist and language.

4. members.csv: This file has msno (user_id), city, bd (may contains outliers), gender, register_via (register method), register_init_time (date) and expirartion_date (date).

5. song_extra_info.csv: This file has features like song_id, song_name and
ISRC (International Standard Recording Code) used to identify songs.

3. EDA :

Purpose of EDA is to understand our dataset, what are feature which will be important to build ML models, and to find error in our dataset using different type of visualization techniques which we are going to explore in this blog post.

  • members.csv: for this csv file if we use members.head() it will display first five rows of members csv file.
we can see that there are NaN values present in gender, like this we can find if particular row contain NaN values o
we can see that there are NaN values present in gender, like this we can find if particular row contain NaN values or not

Finding outliers: any extreme value in our dataset which is not relevant can be said as outlier, when we read our data description there they have said bd column might contain outliers so let’s find out.

bd column:

Box plot: as we know any value/point outside the box is consider as outliers

Box plot for bd column

Count plot:

count plot for bd column

We can see that bd feature contain many zero value and some extreme value and some negative but as common sense we know any user cannot have age zero or negative so we can say these are outliers points

count=0
for i in members["bd"]:
if i<=50 and i>=10:
count+=1
percentage=(count/len(members["bd"]))*100
percentage=("{:.2f}".format(percentage))print(percentage,"% of user is between 10 and 50")output: 40.74 % of user is between 10 and 50

We can see that around only 41% user have age between 10–50 which is correct age for this problem

Gender column:

Pie chart for gender

From above pie can say that proportion male and female is almost balanced

City column:

From above count we can say city 1 has highest number of user

registration_init:

From histogram plot we can say that most of user registered after 2016

expiration_date:

From above histogram plot for expiration date we can say most user account is expiring just before 2020 and it is obvious because most of the user registered just after 2016

Similarly in this way we can use plotting techniques to find out insight from our dataset

Note: we have songs_extra.csv and songs.csv and both contain information about songs so we can merge these two csv file to obtain one file as song_info

song_info = song_extra.merge(songs, on='song_id', how='left') #merging song_extra.csv and songs.csv

Let’s find percentage NaN value in all the feature:

Percentage of NaN value of our feature

We can see that composer and lyricist have highest percentage of NaN value s

train.csv:

count plot system_tab of train.csv

From above plot we can see that most of the user prefer to listen from my library then from discover which means user love to listen song from library and there are people who love listen new discovered songs.

count plot of source screen_name of train.csv

from above plot we can see that most of the user prefer to listen from local playlist

Note: As we know train.csv contain msno and song_id which means songs information as well as members can be merged into it

train_and_members = pd.merge(train, members, on='msno', how='left')
final_train = pd.merge(train_and_members, song_info, on='song_id', how='left')

Heat map of final_train:

heat map of final_train

> From above heat map we can say that composer,lyricist,isrc,name has highest number of missing value

> Composer and lyricist have strong correlation between them and source_sytem_tab and source_type also has strong correlation between them

4.Feature engineering :

So far we have explored our csv files we found most of feature are balanced except bd feature and gender column have NaN values, so in feature engineering we will fill missing values and we will try to extract new feature from existing feature and try create new feature by grouping two or more features.

— Luca Massaron Basically to do any ML problem we need data which we feed to our ML algorithm and to proper output it need appropriate data as input so there is need to generate or find useful feature from existing data is called feature engineering

Numerical Imputation:

It means filling missing values with appropriate value as in our data set we have seen that bd and gender column of members.csv have many missing value

Gender column:

As we know we have 58% NaN value in gender feature and as it is textual/categorical feature we can impute NaN value with gender_not_available

from numpy import nan
# mark zero values as missing or NaN
# fill missing values with mean column values
final_train["gender"].fillna("gender_not_available", inplace=True)
# count the number of NaN values in each column

For every texual/categorical feature we can perform imputation like this

bd feature:

We know ib bd feature we have 50% value as zero and there are some extreme point so we can fill these value with with mean of bd feature

final_train["bd"].fillna(members["bd"].mean(), inplace=True)# count the number of NaN values in each columnprint(members["bd"].isnull().sum())

We will try to find year,month and day from registration_init and expiration_date for that first we have convert these feature to date.time formate

import pandas as pd
import numpy as np
import datetime
members["expiration_date"]= pd.to_datetime(members["expiration_date"],format='%Y%m%d')members["registration_init_time"]= pd.to_datetime(members["registration_init_time"],format='%Y%m%d')

Finding year,month and day from registration_init same we can do with expiration_date

code used from : https://medium.com/@swethalakshmanan14/simple-ways-to-extract-features-from-date-variable-using-python-60c33e3b0501

final_train['year']=  final_train['registration_init_time'].dt.yearfinal_train['month']= final_train['registration_init_time'].dt.monthfinal_train['day'] =final_train['registration_init_time'].dt.day

grouping multiple feaure to get more context out of our data and to come with new feature :

like we have seen there is many repeated value of song_id and msno which means there are many user who listen to same song ,and preferred some artist like i like arjit singh

https://github.com/khushi810/KKBOX_Music_Recommendation_Challenge/blob/master/Music_Recommendation_(EDA%2BFE).ipynb

member_song_count = final_train.groupby('msno').count()['song_id'].to_dict()final_train['member_song_count'] = final_train['msno'].apply(lambda x: member_song_count[x])#artist count for each songartist_song_count = final_train.groupby('artist_name').count()['song_id'].to_dict()final_train['artist_song_count'] = final_train['artist_name'].apply(lambda x: artist_song_count[x])#genre count for each songfirst_genre_id_song_count = final_train.groupby('genre_id').count()['song_id'].to_dict()final_train['genre_id'] = final_train['genre_id'].apply(lambda x: first_genre_id_song_count[x])#language count for each songlang_song_count = final_train.groupby('language').count()['song_id'].to_dict()final_train['lang_song_count'] = final_train['language'].apply(lambda x: lang_song_count[x])#user count for each songsong_member_count = final_train.groupby('song_id').count()['msno'].to_dict()final_train['song_member_count'] = final_train['song_id'].apply(lambda x: song_member_count[x])#agecount for each songage_song_count = final_train.groupby('bd').count()['song_id'].to_dict()final_train['age_song_count'] = final_train['bd'].apply(lambda x: age_song_count[x])

So now we have final_train as our final dataset on top of which we will try build different ml model

5.Data preprocessing :

before feeding our dataset to any ml or dl models we have to convert our feature to numerical format for this we can use label encoding even before this first we have split our dataset into train,test and validation for this we can sklearn.

Y = final_train['target'].values
X = final_train.drop(['target'], axis=1)
#splitting data into train, test and cross validationfrom sklearn.model_selection import train_test_splitx_train,x_test,y_train,y_test=train_test_split(X,Y,test_size=0.40,random_state=42)x_train,x_cv,y_train,y_cv =train_test_split(x_train,y_train,test_size=0.33,random_state=42)

Note : Because of RAM limitation i am using less amount of data but if we have more RAM than we can use whole which result in better result

x_train = x_train[:2965721]
y_train = y_train[:2965721]
x_cv = x_cv[:100000]
y_cv = y_cv[:100000]
x_test = x_test[:2556790]
y_test = y_test[:2556790]

Label encoding for Numerical feature :

numeric_features = ['bd','city','language','registered_via','song_duration_minutes','genre_id','song_year','member_song_count', 'artist_song_count',\
'lang_song_count', 'song_member_count', 'age_song_count']
# transform numeric valuespd.set_option('mode.chained_assignment', None)for feature in numeric_features: scaler = StandardScaler()

x_train[feature] =
scaler.fit_transform(x_train[feature].values.reshape(-1,1)) x_cv[feature] = scaler.transform(x_cv[feature].values.reshape(-1,1)) x_test[feature] = scaler.transform(x_test[feature].values.reshape(-1,1))

We can do same with our categorical feature.

6.Models :

Logistic regression: in our problem it is very important to do hyperparameter tuning

# Hyper parameter tuning using GridearchCV for LRstart = time.time()parameters = {'penalty':['l2', 'l1'],'alpha':[10 ** x for x in range(0, 1)]}clf = SGDClassifier(loss='log', n_jobs=-1, random_state=23, class_weight='balanced')model = GridSearchCV(clf, parameters, scoring = 'roc_auc', n_jobs=-1, verbose=2, cv=3)model.fit(x_train, y_train)print(model.best_estimator_)print('train AUC = ',model.score(x_train, y_train))print('val AUC = ',model.score(x_cv, y_cv))print('Time taken for hyper parameter tuning is : ', (time.time() -  start))print('Done!')

Training our logistic regression with our best parameter :

# train LR with best parameterslr = SGDClassifier(alpha=1, average=False, class_weight='balanced',
early_stopping = False, epsilon=0.1, eta0=0.0, fit_intercept=True,
l1_ratio=0.15, learning_rate='optimal', loss='log', max_iter=1000,
n_iter_no_change=5, n_jobs=-1, penalty='l2', power_t=0.5,
random_state=23, shuffle=True, tol=0.001, validation_fraction=0.1,
verbose=0, warm_start=False)
lr.fit(x_train, y_train)

Finding important feature :

# create dataframe for features and it importancesorted_indices = lr.coef_[0].argsort()features = x_train.columns[sorted_indices]lr_fea_imp = pd.DataFrame({'features' : features,'importance' lr.coef_[0]})

Plotting imortant feature:

Feature importance plot for logistic regression

We can see that month and language have highest feature importance

Confusion matrix :

confusion matrix for logistic regression

Sensitivity :

It answers the question, “How sensitive the classifier is in detecting positive instances?”

Specificity :

It answers question, “How specific or selective is the classifier in predicting positive instances?”

Sensitivity & Specificity plot :

sensitivit and specificity plot

SUMMARY for Logistic regression :

  • For logistic regression Accuracy is better metric because accuracy is showing how accurate is our model in our case our model has 50% accuracy because we using less data because of RAM limitation and confusion matrix sensitivity and specificity shows for most of the point our model is predicting correct

We can same with all the different models like decision tree, Random forest classifier, XGB classifier, Ada boost classifer, and light gbm

Deep learning model(LSTM) :

Let’s give try to deep learning , for dl we will use LSTM model, before LSTM model we should know architecture of LSTM model, this model input take 3 dimensional feature but we have 2D feature so we can np.reshape() inorder to convert 2D to 3D

data_xcv   = x_cv.to_numpy()
x_cv_new = data_xcv.reshape(100000,1,31)
data_xte = x_test.to_numpy()
x_test_new = data_xte.reshape(2556790,1,31)
data_xtr = x_train.to_numpy()
x_train_new= data_xtr.reshape(2965721,1,31)
BATCH_SIZE = 64
FEATURES = 31
def define_model(BATCH_SIZE, FEATURES):'''Function to define model''' tf.keras.backend.clear_session() tf.random.set_seed(1234) input = Input(shape=(None, FEATURES), name='Input_layer') hidden1 = LSTM(256)(input) hidden2 = Dense(256, activation='relu')(hidden1) output = Dense(1, activation='sigmoid')(hidden2) model = Model(inputs=input, outputs=output) model.summary()
return model
model = define_model(BATCH_SIZE, FEATURES)

Plotting LSTM architecture :

LSTM architecture

Training our LSTM model :

EPOCHS = 10
print("Fit model on training data")
history = model.fit(x_train_n,y_train,batch_size=64, epochs=EPOCHS,validation_data=(x_cv_new, y_cv),callbacks=[tensorboard_callback, early_stoppings])

7. Comparision : models and thier respective train score, validation score and kaggle score :

comparision of all models

8. SUMMARY & Future work

  • From above image we can see that LogisticRegression got highest kaggle score so we can use our trained LogisticRegression for prediction on unseen data
  • EDA : We have analyzed all our dataset by using different types visulaization techniques such as bar plot,box plot,pdf etc..,, we found that there were many missing , NaN, incorrect data were present
  • Feature Engineering : In feature engineering steps we removed all missing value and NaN value and replaced with median of column value,corrected incorrect data then we extracted new features from existing feature such we have extracted month,year,day from registration_init_time and expirate_date column and we also did group by operation come up with more new feature
  • Data preparation : We have divided our dataset with 60–40 proportion using sklearn. Because of RAM limitation we have not used all our dataset which lead us to less accuracy and score
  • Data preprocessing : We have transform all our numerical features using standardization. For categorical features we have used label encoding.
  • Models applied : Before applying any ML algorith we did hyperparameter tuning in-order to get best parameters then we have applied various Machine learning algorithms like LR, DT, RF,XgBoost, AdaBoost, LightGBM. We also tried deep learning algorithm LSTM
  • Future works : Because we have less amount of data because of RAM limitaion we are not getting desired result but if we can use whole data we can get better result. We can come with more feature in future and if we do more hyperparameter definitely we can getter result

9. References :

  1. https://medium.com/@briansrebrenik/introduction-to-music-recommendation-and-machine-learning
  2. https://www.kaggle.com/rohandx1996/recommendation-system-with-83-accuracy-lgbm
  3. https://github.com/llSourcell/recommender_live/blob/master/Song%20Recommender_Python.ipynb
  4. https://towardsdatascience.com/the-keys-building-collaborative-filtering-music-recommender-65ec3900d19f
  5. https://www.appliedaicourse.com/course/11/Applied-Machine-learning-course

link to github : https://github.com/anzaraquil/KKBOX-music-recommendation-system

LinkdIn profile : https://www.linkedin.com/in/md-anzar-aquil-ansari-46431617b/

Feel free to connect if you have any doubt.

Thanks for your precious time

--

--