Published in

Ruishi Tao

7 min readMay 7, 2021

Machine Learning Model Project

Data Source: https://www.kaggle.com/anmolkumar/health-insurance-cross-sell-prediction

Link to Drive: https://drive.google.com/drive/folders/1lSEtjG3HD1FR9TtE2PY_SHgrVQnK_HoK?usp=sharing

I intend to use the customer data provided by an insurance company that has provided Health Insurance to its customers.I would build a ML model to predict whether the policyholders (customers) from past year will also be interested in Vehicle Insurance provided by the company. The training data consists of more than 380,000 rows while the test data consists of more than 120,000 rows without target variable. The train data has 11 variables include variable response we want to predict.

Importing needed packages

In [35]:

import warningsimport numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sn
import sklearn.tree as treefrom sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import classification_report, plot_confusion_matrix, accuracy_score, roc_auc_score, make_scorer
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVCwarnings.filterwarnings('ignore')

Preparing Data

First of all, let’s have a quick look of the data.

In [3]:

data = pd.read_csv('train.csv', index_col=0)
data.head()

Out[3]:

GenderAgeDriving_LicenseRegion_CodePreviously_InsuredVehicle_AgeVehicle_DamageAnnual_PremiumPolicy_Sales_ChannelVintageResponseid1Male44128.00> 2 YearsYes40454.026.021712Male7613.001–2 YearNo33536.026.018303Male47128.00> 2 YearsYes38294.026.02714Male21111.01< 1 YearNo28619.0152.020305Female29141.01< 1 YearNo27496.0152.0390

In [4]:

data.info()<class 'pandas.core.frame.DataFrame'>
Int64Index: 381109 entries, 1 to 381109
Data columns (total 11 columns):
 #   Column                Non-Null Count   Dtype  
---  ------                --------------   -----  
 0   Gender                381109 non-null  object 
 1   Age                   381109 non-null  int64  
 2   Driving_License       381109 non-null  int64  
 3   Region_Code           381109 non-null  float64
 4   Previously_Insured    381109 non-null  int64  
 5   Vehicle_Age           381109 non-null  object 
 6   Vehicle_Damage        381109 non-null  object 
 7   Annual_Premium        381109 non-null  float64
 8   Policy_Sales_Channel  381109 non-null  float64
 9   Vintage               381109 non-null  int64  
 10  Response              381109 non-null  int64  
dtypes: float64(3), int64(5), object(3)
memory usage: 34.9+ MB

We can see that Gender, Vehicle_Age,Vehicle_Damage are object data type. We need convert them into numeric type.

In [5]:

data['Gender'].unique()

Out[5]:

array(['Male', 'Female'], dtype=object)

In [6]:

gender_dict = {'Male': 1, 'Female': 0}
data['Gender'] = data['Gender'].map(gender_dict)

In [7]:

data['Vehicle_Age'].unique()

Out[7]:

array(['> 2 Years', '1-2 Year', '< 1 Year'], dtype=object)

In [8]:

vehicle_age_dict = {'> 2 Years':2,'1-2 Year':1,'< 1 Year':0}
data['Vehicle_Age']=data['Vehicle_Age'].map(vehicle_age_dict)

In [9]:

data['Vehicle_Damage'].unique()

Out[9]:

array(['Yes', 'No'], dtype=object)

In [10]:

vehicle_damage_dict = {"Yes":1, "No":0}
data["Vehicle_Damage"]=data["Vehicle_Damage"].map(vehicle_damage_dict)

In [11]:

# Check whether the data have proper data types.
data.info()<class 'pandas.core.frame.DataFrame'>
Int64Index: 381109 entries, 1 to 381109
Data columns (total 11 columns):
 #   Column                Non-Null Count   Dtype  
---  ------                --------------   -----  
 0   Gender                381109 non-null  int64  
 1   Age                   381109 non-null  int64  
 2   Driving_License       381109 non-null  int64  
 3   Region_Code           381109 non-null  float64
 4   Previously_Insured    381109 non-null  int64  
 5   Vehicle_Age           381109 non-null  int64  
 6   Vehicle_Damage        381109 non-null  int64  
 7   Annual_Premium        381109 non-null  float64
 8   Policy_Sales_Channel  381109 non-null  float64
 9   Vintage               381109 non-null  int64  
 10  Response              381109 non-null  int64  
dtypes: float64(3), int64(8)
memory usage: 34.9 MB

In [12]:

# Split the train dataset independent variables X and dependent varialbe y.
X, y = data.iloc[:, :-1], data.iloc[:, -1]
X.info()
y<class 'pandas.core.frame.DataFrame'>
Int64Index: 381109 entries, 1 to 381109
Data columns (total 10 columns):
 #   Column                Non-Null Count   Dtype  
---  ------                --------------   -----  
 0   Gender                381109 non-null  int64  
 1   Age                   381109 non-null  int64  
 2   Driving_License       381109 non-null  int64  
 3   Region_Code           381109 non-null  float64
 4   Previously_Insured    381109 non-null  int64  
 5   Vehicle_Age           381109 non-null  int64  
 6   Vehicle_Damage        381109 non-null  int64  
 7   Annual_Premium        381109 non-null  float64
 8   Policy_Sales_Channel  381109 non-null  float64
 9   Vintage               381109 non-null  int64  
dtypes: float64(3), int64(7)
memory usage: 32.0 MB

Out[12]:

id
1         1
2         0
3         1
4         0
5         0
         ..
381105    0
381106    0
381107    0
381108    0
381109    0
Name: Response, Length: 381109, dtype: int64

Data Visualization and Preprocess

Let’s use data visualization to see and understand hidden trends, outliers, and patterns in data.

In [13]:

sn.pairplot(data)
plt.show()

In [15]:

plt.figure(figsize=(10, 10))
sn.heatmap(data.corr(),annot=True)
plt.show()

We can see that there is no simple linear relationship between the data from the pair plot and the heatmap of correlation matrix, so we can’t select a part of the variables to train our model. I decided to use all the variables to train the model. In particular, as shown below, the number of response 0s and 1s is not balanced, and what we really need is to classify 1 into 1, so we will focus on the accuracy of classifying 1 into 1. So the evaluation metric for the model is ROC_AUC score, rather than accuracy.

In [ ]:

sn.countplot(y)
plt.show()

Preprocess

Ranges of the features of the dataset are not the same. To address this problem, normalize the ranges of the features into a uniform range, in this case, 0–1.

In [16]:

scaler = MinMaxScaler(feature_range=(0, 1))
X = scaler.fit_transform(X)

Split the data into train and test.

In [17]:

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

Next I will train four different classification models and select the best model and fine tune it.

Logistical Regression

In [24]:

logreg = LogisticRegression(max_iter=10000)
logreg.fit(X_train, y_train)
logreg_pred = logreg.predict(X_test)
print('The accuracy of logistical regression is %.2f%%' % (accuracy_score(y_test, logreg_pred) * 100) )
print('The ROC AUC of logistical regression is %.2f%%' % (roc_auc_score(y_test, logreg_pred) * 100) )
plot_confusion_matrix(logreg, X_test, y_test)
plt.show()The accuracy of logistical regression is 87.67%
The accuracy of logistical regression is 50.08%

Decision Tree Classifier

In [25]:

dt = DecisionTreeClassifier(criterion='entropy', max_depth=3, random_state=0)
dt.fit(X_train, y_train)
dt_pred = dt.predict(X_test)
print('The accuracy of decision tree classifier is %.2f%%' % (accuracy_score(y_test, dt_pred) * 100) )
print('The ROC AUC of decision tree is %.2f%%' % (roc_auc_score(y_test, dt_pred) * 100) )
plot_confusion_matrix(dt, X_test, y_test)
plt.show()The accuracy of decision tree classifier is 87.67%
The ROC AUC of decision tree is 50.00%

In [26]:

## plot tree
plt.figure(figsize=(20, 20))
visual_tree = tree.plot_tree(dt, class_names=['0', '1'], feature_names=data.columns[:-1], filled=True, fontsize=10)
plt.show()

Random forest

In [31]:

rfc = RandomForestClassifier(n_estimators=10, max_features=1, random_state=0)
rfc.fit(X_train, y_train)
rfc_pred = rfc.predict(X_test)
print('The accuracy of random forest is %.2f%%' % (accuracy_score(y_test, rfc_pred) * 100) )
print('The ROC AUC of random forest is %.2f%%' % (roc_auc_score(y_test, rfc_pred) * 100) )
plot_confusion_matrix(rfc, X_test, y_test)
plt.show()The accuracy of random forest is 86.30%
The ROC AUC of random forest is 54.47%

AdaBoost Classifier

In [28]:

ada_boost = AdaBoostClassifier(dt, n_estimators=200, random_state=0, learning_rate=0.05)
ada_boost.fit(X_train, y_train)
ada_pred = ada_boost.predict(X_test)
print('The accuracy of ada boost is %.2f%%' % (accuracy_score(y_test, ada_pred) * 100) )
print('The ROC AUC of ada boost is %.2f%%' % (roc_auc_score(y_test, ada_pred) * 100) )
plot_confusion_matrix(ada_boost, X_test, y_test)
plt.show()The accuracy of ada boost is 87.68%
The ROC AUC of ada boost is 50.00%

Fintuning the random forest classifier

We can see that all the models used below have the similar accuracy, but the accuracy of classifying 1s of random forest is higher than others. That is to say, the ROC AUC of random forest is best among the four models. So I decide to fine the parameter n_estimates of random forest classifier to get better results using the cross validation.

In [40]:

accs = []
for i in range(1, 20):
    rfc = RandomForestClassifier(n_estimators=i, max_features=1, random_state=0)
    rfc.fit(X_train, y_train)
    scores = cross_val_score(rfc, X_train, y_train, cv=5, scoring=make_scorer(roc_auc_score))
    accs.append(scores.mean())

In [47]:

plt.plot(accs)
plt.xlabel('n_estimators')
plt.ylabel('ROC AUC')
plt.show()

We can see that when n_estimators = 1, the ROC AUC of random forest get the best results. So we finally use it as the best model. It’s ROC AUC on test data is 59.50%. Because the returned fitted model has already been pruned under the hood in the sklearn implementation, we don’t need to pruning the model.

In [51]:

rfc = RandomForestClassifier(n_estimators=1, max_features=1, random_state=0)
rfc.fit(X_train, y_train)
rfc_pred = rfc.predict(X_test)
print('The accuracy of random forest is %.2f%%' % (accuracy_score(y_test, rfc_pred) * 100) )
print('The ROC AUC of random forest is %.2f%%' % (roc_auc_score(y_test, rfc_pred) * 100) )
plot_confusion_matrix(rfc, X_test, y_test)
plt.show()The accuracy of random forest is 82.34%
The ROC AUC of random forest is 59.50%