Safe driver prediction using PySpark and Logistic Regression
When it comes to data science and machine learning resources and competitions, kaggle is a great place. One of the competitions hosted there was Porto Seguro’s (a large insurance company from Brazil) safe driver prediction. Basically the challenge is to build a model that predicts the probability that a driver will initiate an auto insurance claim in the next year.
In this tutorial we will not go into building a complex model that can obtain a very good score, but rather we will focus on the steps to build a distributed machine learning pipeline using Apache Spark, Logistic Regression and the Jupyter Notebook as an environment.
First let’s talk about the prerequisites that you need to have installed:
- Python 2.7
- Apache Spark 2+
- Jupyter Notebook
- pyspark_dist_explore (python package that makes it easier to plot histograms of PySpark dataframes)
- spark_stratifier (python package for stratified cross validation in Spark ML)
If you need a resoure on how to make PySpark available in the Jupyter Notebook, please have a look into one of my previous articles.