Sparkify users churn prediction

Roman Ivashkin
4 min readAug 15, 2020

--

Spotify is a Swedish music streaming and media services provider. The main project’s aim is to predict users churn for Sparkify (a Spotify-style a mythical music service). Like every Internet service Spotify makes money from user activity. Sparkify’s users perform different actions on the service pages. Project’s data is a log file. It’s mean that the data have more than one record for each user. The project’s full dataset is 12GB, of which you can analyze a mini subset is 128 MB.

Problem Statement

The case solvation consists of:

  • determine target to predict based on the information in 18th feature columns;
  • predict users churn.

The shape of the dataset is 286500 rows and 18 columns. The one record display that the user Colin Freeman listened the song Rockpools by Martha Tilston.

The data is not a dataset. It is a part of log file. This data does not consist of any features could be using for prediction directly. All of the data columns are using to describe a user or a user action. We can see the 18th columns in the dataframe and we have to determine which column contains the information about user churn.

We have to explore dataframe columns. And the first task is to determine target to predict. Column ‘page’ have this information.

The visiting counts of the ‘Cancel’ and ‘Cancellation Confirmation’ are 52. These rows contain the information about churned users.

Handling null values

Initial dataset has null values in some columns. We can see that columns ‘firstName’ and ‘LastName’ have 8346 null values. Furthermore, the column ‘userAgent’ includes the 8346 null values too.

Nevertheless, the column ‘userId’ has not null values. We have to check the column ‘userId’ on invalid data, like an empty row. An empty row looks like the image on the left.

We can see the ‘Logged Out’ value in the ‘auth’ column where the ‘userId’ column value is an empty row. What the values does the ‘auth’ column include, except ‘Logged Out’?

The plot show us that the ‘auth’ column consists of the ‘Logged Out’ and the ‘Guest’ values. We can drop all rows with the ‘Logged Out’ and the ‘Guest’ values due to these rows are the audit records and have not the information about a user churn or any user activity.

The remained rows with null values we can not drop due to they have information to predict users churn.

Categorical features handling

Some columns in tha dataset have categorical values. We have to change them to ‘0’ and ‘1’ values using ‘dummy’ columns.

Numerical features

Each user has different count of listened song, completed sessions. We should create columns contained of users listened songs and completed sessions. These columns describe the users activity. We can call the columns as the first part users activity.

The second part users activity are the visited pages. As we saw earlier the users did not cancel a subscription visited more times the pages ‘Add Friend’, ‘Add to Playlist’, ‘Save Settings’, ‘Submit Upgrade’, ‘Thumbs Up’. We can create some columns to store information about users activity on the pages.

Firstly, we create several pandas dataframes with count of visited pages. Secondly, we create the columns with agregated data in the spark dataframe. Thirdly, we recreate the spark dataframe and drop columns with raw categorical features.

Models implementation

PySpark’s MLlib has the most common machine learning classification algorithms. In the project we are using three of them:

Received metrics

Random Forest model gives the best F1-score: 0.87.

Logistic Regression model take the second place with F1-score: 0.8.

Gradient-boosted tree is the worst model for the project, F1-score: 0.7.

Conclusion

The hardest part of the project was a prerpocessing the dataset. I was wandered that the initial dataset had not been a ‘dataset’. It was not a spread of some features. The initial dataset was a part of log file that consisted users actifity.

I had to transform the project’s data to dataset with unique user’s records.

The second problem was the most dataset columns was not a feature columns. These columns included records that described user’s sessions and actions. And I wrangled the data and aggregated it.

My full solution you can find at here.

--

--