WSDM — KKBox’s Music Recommendation Challenge

Subodh Lonkar
The Startup
Published in
14 min readJan 28, 2021
Photo by Juja Han on Unsplash

A Recommender System is a process that seeks to predict user preferences. These are the systems that are designed to recommend things to the user based on many different factors. Recommender systems aim to predict users’ interests and recommend product items that quite likely are interesting for them.

Recommendation Systems deal with a large volume of information present by filtering the most important information based on the data provided by a user and other factors that take care of the user’s preference and interest. It finds out the right match between user and item and imputes the similarities between users and items for recommendation.

Contents Summary

  1. What is a Music Recommendation System and what is it’s significance?
  2. Business Problem
  3. Problem Statement
  4. Source of Data
  5. Machine Learning Problem Formulation
  6. Exploratory Data Analysis
  7. Data Preprocessing & Feature Engineering
  8. Machine Learning Models for Recommendation Engines
  9. Comparison of Results
  10. Deployment
  11. Future Work & Scope of Improvement
  12. References

Let’s get started!

1. What is a Music Recommendation System and what is it’s significance?

Music Recommendation Systems are a specific type of Recommendation Systems which predicts user preferences and recommend songs to the user based on multiple factors. The number of songs available exceeds the listening capacity of an individual in their lifetime, & is constantly exceeding day by day. It is tedious for an individual to sometimes to choose from millions of songs and there is also a good chance missing out on songs which could have been the favorites.

Thus, Music service providers need an efficient way to manage songs and help their customers to discover music by giving a quality recommendation. For building this recommendation system, they deploy machine learning algorithms to process data from a million sources and present the listener with the most relevant songs.

There are mainly three types of recommendation system: content-based, collaborative and popularity. The content-based system predicts what a user like based on what that user like in the past. The collaborative based system predicts what a particular user like based on what other similar users like. The problem with popularity based recommendation system is that the personalization is not available with this method i.e. even if the behavior of the user is known, a personalized recommendation cannot be made.

In this case study, we will be looking towards some good techniques to recommend music to brand new as well as existing users. By building this system, we aim to provide a better user experience for the app users.

2. Business Problem

The 11th ACM International Conference on Web Search and Data Mining (WSDM 2018) challenged to build a better music recommendation system using a donated dataset from KKBOX. WSDM (pronounced “wisdom”) is one of the premier conferences on web inspired research involving search and data mining.

The glory days of Radio DJs have passed, and musical gatekeepers have been replaced with personalizing algorithms and unlimited streaming services. With easy access to various kinds of music across the globe, public is now listening to all kinds of music. Existing algorithms, however, struggle in key areas. Without enough historical data, how would an algorithm know if listeners will like a new song or a new artist? And how would it know what songs to recommend brand new users?

The dataset is from KKBOX, Asia’s leading music streaming service, holding the world’s most comprehensive Asia-Pop music library with over 30 million tracks. The input contains text data only, and no audio features. They currently use a collaborative filtering based algorithm with matrix factorization and word embedding in their recommendation system but believe new techniques could lead to better results.

3. Problem Statement

We are asked to predict the chances of a user listening to a song repetitively after the first observable listening event within a time window was triggered. The objective is to make prediction on whether a user will re-listen to a song or not. Broadly, it is a music recommendation problem (ROC-AUC Score as per Kaggle Challenge Evaluation Metric).

4. Source of Data

Get the data from: https://www.kaggle.com/c/kkbox-music-recommendation-challenge/data

Data files:

members.csv -

msno : user id
city
bd: age. Note: this column has outlier values, please use your judgement.
gender
registered_via: registration method
registration_init_time: format %Y%m%d
expiration_date: format %Y%m%d
‘gender’ is the only attribute in this dataframe that has some null values.

Shape: (34403, 7)

sample_submission.csv

song_extra_info.csv

song_id
song name — the name of the song.
isrc — International Standard Recording Code, theoretically can be used as an identity of a song. However, what worth to note is,
ISRCs generated from providers have not been officially verified; therefore the information in ISRC, such as country code and
reference year, can be misleading/incorrect. Multiple songs could share one ISRC since a single recording could be re-published
several times.

Shape: (2295971, 3)

songs.csv

song_id
song_length: in ms
genre_ids: genre category. Some songs have multiple genres and they are separated by |
artist_name
composer
lyricist
language

Note that the data is in Unicode.

Shape: (2296320, 7)

test.csv

id: row id (will be used for submission)
msno: user id
song_id: song id
source_system_tab: the name of the tab where the event was triggered. System tabs are used to categorize KKBOX mobile apps
functions. For example, tab my library contains functions to manipulate the local storage, and tab search contains functions relating
to search.
source_screen_name: name of the layout a user sees.
source_type: an entry point a user first plays music on mobile apps. An entry point could be album, online-playlist, song… etc.

Shape: (2556790, 6)

train.csv

msno: user id
song_id: song id
source_system_tab: the name of the tab where the event was triggered. System tabs are used to categorize KKBOX mobile apps functions. For example, tab my library contains functions to manipulate the local storage, and tab search contains functions relating to search.
source_screen_name: name of the layout a user sees.
source_type: an entry point a user first plays music on mobile apps. An entry point could be album, online-playlist, song,.. etc.
target: this is the target variable. target=1 means there are recurring listening event(s) triggered within a month after the user’s very first observable listening event, target=0 otherwise.

Shape: (7377418, 6)

5. Machine Learning Problem Formulation

We are asked to predict the chances of a user listening to a song repetitively after the first observable listening event within a time window was triggered. The objective is to make prediction on whether a user will re-listen to a song or not. Broadly, it is a music recommendation problem.

This can be also thought of as classification problem, that is, whether user will or will not listen to the recommended song.

Performance metric(s):

  1. Primary metric is Area under the ROC curve between the predicted probability and the observed target.
  2. Reducing False Positives can be thought of as the secondary metric.

Machine Learning Objective and Constraints:

  1. Maximize Area Under Curve.
  2. Try to provide some interpretability.

6. Exploratory Data Analysis

As the name suggests, purpose of EDA is to explore, analyze & understand the Data. This is an important stage as we can dive deep into the data in this stage and derive useful insights which will then prove vital during Feature Engineering and Building Machine Learning Models.

A brief overview of the EDA can be found at https://share.streamlit.io/learner-subodh/streamlit-example/kkbox.py

The target variable ‘target’ is present only in train.csv, thus, we will merge train.csv, members.csv, songs.csv & songs_extra_into.csv and analyze data.

Distribution of the ‘target’ variable:

We can observe that the dataset is balanced around the ‘target’ attribute.

Age Distribution of users:

Most users are less than 60 years of age. However, some users have age less than or equal to 0 years while there are a few with age greater than 100 years, some even with age greater than 1000 years, which is practically impossible. Around 35 percent of the age values are less than or equal to 0 years, while around 66 percent of them lie between 0 & 54 years of age, both inclusive. More than 99 percent of the users have their age registered between 0 & 54 years.

Source System Tab vs target:

Max number of listening events occur at ‘my library’ & ‘discover’ tabs, which is kind of logical as the prior one tends to have songs which a user listens quite frequently while the later acts as a platform to explore more songs which a user might in turn like. 60 percent of the events generated through the ‘my library’ tab have target value of 1 while ‘discover’ tab has events with ‘target’ as 0 slightly more than that with ‘target’ as 1.

Source Type vs target:

The source type ‘local library’ generates the most number of events out of which around two third of the events have target value as 1. It is followed by source types ‘online playlist’ and ‘local playlist’ which have more contributions towards ‘target’ = 0 and ‘target’ = 1 respectively.

Source Screen Name vs target:

Out of the total events triggered through given source screens, more than half of the events are triggered at the ‘local playlist more’ screen. And out of the total events generated from ‘local playlist more’ screen, around two third of events have ‘target’ as 1. The ‘online playlist more’ screen also has a good enough contribution towards ‘target’ = 1 events.

Thus, it is quite evident that there is some overlap in the tabs or screens provided under ‘Source System Tab’, ‘Source Type’ & ‘Source Screen Name’.

Distribution of Genders:

The dataset is balanced around the ‘gender’ attribute.

Gender vs target:

Both genders seem to be evenly poised towards both ‘target’ values.

Song Duration vs target:

Most songs, around 50 percent to be precise, have a duration of 4 minutes, and around 95 percent of the songs have a duration of either 3, 4 or 5 minutes. Also, most users prefer listening to songs with not so long duration, that is somewhat close to or around 3 to 5 minutes of length. Both ‘target’ values are in equal number for all songs with different durations.

Distribution of Listening events:

As can be observed, a lot of songs have been listened to very few number of times and very few songs have been listened to a lot of times. Thus, these songs which are listened to many times might be very popular ones either in the area where the data is collected or even globally.

City vs target:

Most of the users in the given dataset belong to city having id as 1 or the data that is provided was collected largely in cities with id as 1 followed by id 13 & 5. Thus, we can expect some local trends in the type of songs generally liked by the people living in these particular cities. Both the ‘target’ values are almost in equal numbers in all cities.

Language vs target:

Most of the users tend to prefer the language having id as 3 followed by id 52. There isn’t any significant difference in the ‘target’ values they contribute to. But after having a look at the previous plot of cities data, we can infer that users/people living in cities 1 & 13 prefer languages 3 & 52 respectively.

Registration via vs target:

A lot of users prefer registering via modes 9 & 7, which may be easy ways to register probably using existing google accounts or some social media account, etc. Both these modes along with others have somewhat same number of events triggered for both the ‘target’ values.

Heatmap for missing numbers:

We can see there is a strong correlation in the trends of missing values for some features.

Dendrogram for missing numbers:

A missing value dendrogram clusters features which show a strong correlation in trends of missing values. Features clustered initially have a strong correlation while those joined/clustered later don’t have that strong correlation.

7. Data Preprocessing & Feature Engineering

In this stage, we will be handling missing &/or null values & will engineer some new features which we think might prove vital for making recommendations.

On analyzing the data, it came to notice that the lyricist attribute has around 43% of its values missing, which is followed by the gender attribute which has around 40% of its values missing & the composer attribute which has around 23% of its values missing. Other given features containing missing values are source_system_tab, source_screen_name, source_type, genre_ids, artist_name, name & isrc.

These missing values need to be handled carefully in a way that it doesn’t affect final results.

Integer or Floating Point features which are ordinal & are missing can be filled by replacing by the mean or median of that particular attribute. Let’s fill missing values of Categorical features with some text like ‘attribute_value_missing’.

Duration of songs is provided in milliseconds, which we need to be converted into minutes format for easy understanding. Information in the isrc code like Country Code, Registrant Code & Year of reference needs to be extracted & will surely add weightage while building models. The first two characters of The International Standard Recording Code (ISRC) is a two-character country code and the last two characters of ISRC represent the last two digits of the reference year. Statistic features of genre, artist, composer, lyricist: Because some songs could contain more than one genre, artist, composer and lyricist. We calculate the count of genre, artist, composer and lyricist in each song. In some songs, the Artist, composer and lyricist could be same. We will be also using context, SVD generated, timestamp & dot features for the given dataset.

8. Machine Learning Models for Recommendation Engines

Some potential models which can probably make an impact in our case are as follows:

1. Logistic Regression

I chose logistic regression as my baseline model, since this is a simple model which doesn’t make any strong assumptions on the distribution of data. I also experimented L1 & L2 regularization.

Logistic Regression uses cross-entropy loss as its cost function, which is:

I have implemented logistic regression & have achieved a Kaggle score of 0.57383.

Importance or contribution of top 25 features can be found as follows:

2. Decision Tree and Random Forest

A decision tree is a flowchart-like structure in which each internal node represents a “test” on an attribute. Each branch represents the outcome of the test, and each leaf node represents a class label (decision taken after computing all attributes).The paths from root to leaf represent classification rules. Random forest ensembles hundreds of decision trees together to provide a more accurate estimate. Each decision tree in the forest considers a random subset of features when forming questions and only has access to a random set of the training data points.

3. Gradient Tree Boosting

Gradient boosting is a machine learning technique for regression and classification problems, which produces a prediction model in the form of an ensemble of weak prediction models, typically decision trees. It builds the model in a stage-wise fashion like other boosting methods do, and it generalizes them by allowing optimization of an arbitrary differentiable loss function.

In pseudocode, the generic gradient boosting method is:

Source: https://en.wikipedia.org/wiki/Gradient_boosting

3. 1. XGBoost

XGBoost is a scalable end-to-end tree boosting system, its sparsity-aware split finding algorithm helps a lot with the sparse dataset we have. The main idea is to only collect statistics of non-missing entries and classifies missing value into default direction. XGBoost is an optimized distributed gradient boosting library designed to be highly efficient, flexible and portable. It implements machine learning algorithms under the Gradient Boosting framework. XGBoost provides a parallel tree boosting (also known as GBDT, GBM) that solve many data science problems in a fast and accurate way.

3. 2. LightGBM

LightGBM is a gradient boosting framework that uses tree based learning algorithms. LightGBM extends the gradient boosting algorithm by adding a type of automatic feature selection as well as focusing on boosting examples with larger gradients. This can result in a dramatic speedup of training and improved predictive performance. It is designed to be distributed and efficient with the following advantages:

  1. Faster training speed and higher efficiency.
  2. Lower memory usage.
  3. Better accuracy.
  4. Support of parallel and GPU learning.
  5. Capable of handling large-scale data.

I have implemented LightGBM & have achieved a Kaggle Score of 0.72755 with the best models which places in Top 0.7% on the Kaggle leaderboard.

Importance or contribution of top 25 features can be found as follows:

Classification Report is as follows:

As seen in the metrics, along with AUC Score, the primary metric provided by Kaggle, our goal was also to decrease the False Positives which would in turn, increases the precision. This goal seems to be taken care of quite decently if we refer the classification report.

9. Comparison of Results

Best technique in my case turns out to be the Gradient Tree Boosting & the model being LightGBM which achieved a score of 0.72755 on submission to Kaggle.

10. Deployment

I have hosted/deployed the Exploratory Data Analysis part of this case study. It can be found at https://share.streamlit.io/learner-subodh/streamlit-example/kkbox.py.

I have used Streamlit and GitHub for this. Links for documentation can be found under references.

We can similarly extend this idea for deployment of the best model using Flask and Streamlit along with GitHub or with any other cloud platform like AWS, GCP or Azure.

11. Future Work & Scope of Improvement

I have used a maximum of 5 million data points due to limitation of resources. Using all the data will surely lead to better performance of models.

More aggressive data preprocessing & feature engineering along with better selection might result in better results.

Although I have done hyperparameter tuning to a decent extent, we can surely dive into it much deeper & hunt for better results.

For now, I have deployed my EDA which can be complemented with the deployment of the final model, which I might surely think of implementing in near future.

12. References

  1. http://cs229.stanford.edu/proj2019spr/report/4.pdf
  2. https://www.kaggle.com/rohandx1996/recommendation-system-with-83-accuracy-lgbm
  3. https://www.kaggle.com/vinnsvinay/introduction-to-boosting-using-lgbm-lb-0-68357
  4. https://medium.com/@anjar.aquil123/wsdm-kkboxs-music-recommendation-challenge-87ca72c41593
  5. https://www.kaggle.com/asmitavikas/feature-engineered-0-68310
  6. https://wsdm-cup-2018.kkbox.events/pdf/WSDM_KKboxs_Music_Recommendation_Challenge_6th_Solution.pdf
  7. https://docs.streamlit.io/en/stable/

If you like this article, don’t forget to leave a “clap”!

Thank you for your time :)

--

--

Subodh Lonkar
The Startup

Data Scientist passionate about leveraging cutting-edge technology for solving real world business problems. Portfolio: https://learner-subodh.github.io/.