Modeling Home Insurance Quote Conversion

Varsha N Bhat
3 min readJul 18, 2019

--

I recently came across a data science problem from a take-home assignment where the conversion rate of an online customer is calculated from a number of features including their demographic information, total number of visits to the site and the source (how the person came to the site). I wanted to find similar problems/datasets and stumbled across a 3 years old kaggle competition where conversion rate of a home insurance policy from a prospective customer is predicted from certain anonymized features.

Most kaggle competitions provide datasets with anonymized features in order to protect information. This itself adds to the challenge because we don’t begin selecting features based on our own biases. In this particular dataset, there are about 300 columns, and in order to build a kernel which takes less than 20 minutes lets say, we need to weed out the columns that we don’t think are too important.

The first step is to identify what kind of data these columns contain. A simple .head() operation lets us know that almost all columns except quote date represent numeric data. The quote date column should split into month, day of week and year data because sometimes there could be trends or cycles based on the time of the week or the month in which a prospective customer has a higher chance of purchasing the insurance.

The next step is the select features. In this case since the data is anonymized we do not much about what each column represents therefore we can do a simple check for correlation between all features.

Heat-map of Homesite Insurance Quote Features

The above picture shows the heat-map for correlations between features. The lighter parts of the map represent a strong correlation between two features in which case we must eliminate one feature. We can select a threshold like 0.9 for correlation and select only those features that are not strongly correlated to each other. By doing so, we can reduce the total number of features from 300 columns down to 176 unique columns/features.

I chose AdaBoost using Decision Tree Classifier because its the most commonly used and easy to understand boosting algorithm. In very simple terms, a boosting algorithm uses multiple weak classifiers(which don’t fit all training data) and combines them into one classifier. One of the biggest advantages of doing so is that boosting reduces overfitting. In the case of AdaBoost, which is short for, Adaptive boosting, multiple weak classifiers are combined by giving them each coefficients and simultaneously updating the weights for the training examples. In general, the most difficult to fit examples are given larger weights compared to those examples that can be fit very easily by weak classifiers. The resulting algorithm is a much more complex one compared to the weak classifiers we begin with. My favorite source for understanding boosting is the free Udacity course : Machine Learning

By using the AdaBoosting method, we can predict the conversion rate for prospective customers with approximately 93% accuracy. You can find my code in my github page

--

--