Predicting Churns of App Users

A Case Study using Pyspark on Big Data from Udacity

Image by Sara Kurfeß on Unsplash

Introduction

Data Understanding: A typical user story on Sparkify

https://gist.github.com/MarkusG-DS/18c6f0addf5fff7859ef1938922b41f0.js
churn_df.select(['userId', 'page', 'timestamp', 'level', 'song', 'sessionId', 'length', 'churn']).where(churn_df.userId == '51').sort('timestamp').show(50)
(Image by author)

Data Preparation: Identifying features

(Image by author)

Modeling

Baseline Model

baseline_model = test.withColumn('pred', lit(0.0))
evaluator = MulticlassClassificationEvaluator(predictionCol='pred')
print('Accuracy:{}'.format(evaluator.evaluate(baseline_model, {evaluator.metricName: 'accuracy'})))
print('F1 score:{}'.format(evaluator.evaluate(baseline_model, {evaluator.metricName: 'f1'})))
Accuracy:0.8405455390090193
F1 score:0.7677254250697467

Logistic Regression (lr)

Training Time: 247.50 secondsResults for training data:
F-score: 0.8398973
Results for test data:
F-score: 0.8405455

Random Forest Classifier (rf)

Training Time: 297.40 secondsResults for training data:
F-score: 0.9222908
Results for test data:
F-score: 0.9232617

Gradient Boosted Tree Classifier (gbt)

Training Time: 314.21 secondsResults for training data:
F-score: 0.9866215
Results for test data:
F-score: 0.9867559

Improvements

#build paramGrid
paramGrid_gbt = ParamGridBuilder() \
.addGrid(gbt.maxDepth,[4, 8, 12]) \
.addGrid(gbt.maxIter, [8, 12, 16]) \
.build()
# set evaluator
f1_eval = MulticlassClassificationEvaluator(metricName='f1')
# set cross validation
crossval_gbt = CrossValidator(estimator=gbt_pipeline,
estimatorParamMaps=paramGrid_gbt,
evaluator=f1_eval,
numFolds=3)
Results for validation data:
F1 score: 0.9985879546391391

Conclusion

Summary

Reflection

Improvements

Data Scientist / Data Engineer