Safe driver prediction using PySpark and Logistic Regression

Bogdan Cojocar
5 min readMar 19, 2018

When it comes to data science and machine learning resources and competitions, kaggle is a great place. One of the competitions hosted there was Porto Seguro’s (a large insurance company from Brazil) safe driver prediction. Basically the challenge is to build a model that predicts the probability that a driver will initiate an auto insurance claim in the next year.

In this tutorial we will not go into building a complex model that can obtain a very good score, but rather we will focus on the steps to build a distributed machine learning pipeline using Apache Spark, Logistic Regression and the Jupyter Notebook as an environment.

First let’s talk about the prerequisites that you need to have installed:

  • Python 2.7
  • Apache Spark 2+
  • Jupyter Notebook
  • pyspark_dist_explore (python package that makes it easier to plot histograms of PySpark dataframes)
  • spark_stratifier (python package for stratified cross validation in Spark ML)

If you need a resoure on how to make PySpark available in the Jupyter Notebook, please have a look into one of my previous articles.

Reading the data

--

--

Bogdan Cojocar

Big data consultant. I write about the wonderful world of data.