Machine Learning to Kaggle Caravan Insurance Challenge on R

Kieran Tan Kah Wang
The Startup
Published in
11 min readSep 26, 2020

--

Photo by Scott Graham on Unsplash

Recapping from the previous two posts, this post will utilise machine learning algorithms to predict customers who are mostly likely to purchase caravan policy based on 85 historic socio-demographic and product-ownership data attributes. In the previous post, we talked about using several feature selection methods like forward/backward stepwise selection and lasso regularisation to reduce the number of attributes we should fit into our ML algorithms. We will discuss which set of attributes as identified by different feature selection methods, will give us the best prediction results.

Logistic Regression model development 1

Since the target attribute (i.e. whether the customer purchases or not purchase caravan policy) is binomial discrete, we can use the simplest logistic regression as our first ML algorithm usingglm function from glmnet package. We first fit all the variables in to see how does the model performs using V86~ which means all attributes as the predictor variables other than V86 which is our target variable.

Observations from this model:

  • At 5% significance…

--

--