Safe driver prediction using PySpark and Logistic Regression

5 min readMar 19, 2018

When it comes to data science and machine learning resources and competitions, kaggle is a great place. One of the competitions hosted there was Porto Seguro’s (a large insurance company from Brazil) safe driver prediction. Basically the challenge is to build a model that predicts the probability that a driver will initiate an auto insurance claim in the next year.

In this tutorial we will not go into building a complex model that can obtain a very good score, but rather we will focus on the steps to build a distributed machine learning pipeline using Apache Spark, Logistic Regression and the Jupyter Notebook as an environment.

First let’s talk about the prerequisites that you need to have installed:

Python 2.7
Apache Spark 2+
Jupyter Notebook
pyspark_dist_explore (python package that makes it easier to plot histograms of PySpark dataframes)
spark_stratifier (python package for stratified cross validation in Spark ML)

If you need a resoure on how to make PySpark available in the Jupyter Notebook, please have a look into one of my previous articles.

Safe driver prediction using PySpark and Logistic Regression

Reading the data

Written by Bogdan Cojocar