Ruishi Tao
Ruishi Tao
Published in
7 min readMay 7, 2021

--

Machine Learning Model Project

Data Source: https://www.kaggle.com/anmolkumar/health-insurance-cross-sell-prediction

Link to Drive: https://drive.google.com/drive/folders/1lSEtjG3HD1FR9TtE2PY_SHgrVQnK_HoK?usp=sharing

I intend to use the customer data provided by an insurance company that has provided Health Insurance to its customers.I would build a ML model to predict whether the policyholders (customers) from past year will also be interested in Vehicle Insurance provided by the company. The training data consists of more than 380,000 rows while the test data consists of more than 120,000 rows without target variable. The train data has 11 variables include variable response we want to predict.

Importing needed packages

In [35]:

import warningsimport numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sn
import sklearn.tree as tree
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import classification_report, plot_confusion_matrix, accuracy_score, roc_auc_score, make_scorer
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
warnings.filterwarnings('ignore')

Preparing Data

First of all, let’s have a quick look of the data.

In [3]:

data = pd.read_csv('train.csv', index_col=0)
data.head()

Out[3]:

GenderAgeDriving_LicenseRegion_CodePreviously_InsuredVehicle_AgeVehicle_DamageAnnual_PremiumPolicy_Sales_ChannelVintageResponseid1Male44128.00> 2 YearsYes40454.026.021712Male7613.001–2 YearNo33536.026.018303Male47128.00> 2 YearsYes38294.026.02714Male21111.01< 1 YearNo28619.0152.020305Female29141.01< 1 YearNo27496.0152.0390

In [4]:

data.info()<class 'pandas.core.frame.DataFrame'>
Int64Index: 381109 entries, 1 to 381109
Data columns (total 11 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Gender 381109 non-null object
1 Age 381109 non-null int64
2 Driving_License 381109 non-null int64
3 Region_Code 381109 non-null float64
4 Previously_Insured 381109 non-null int64
5 Vehicle_Age 381109 non-null object
6 Vehicle_Damage 381109 non-null object
7 Annual_Premium 381109 non-null float64
8 Policy_Sales_Channel 381109 non-null float64
9 Vintage 381109 non-null int64
10 Response 381109 non-null int64
dtypes: float64(3), int64(5), object(3)
memory usage: 34.9+ MB

We can see that Gender, Vehicle_Age,Vehicle_Damage are object data type. We need convert them into numeric type.

In [5]:

data['Gender'].unique()

Out[5]:

array(['Male', 'Female'], dtype=object)

In [6]:

gender_dict = {'Male': 1, 'Female': 0}
data['Gender'] = data['Gender'].map(gender_dict)

In [7]:

data['Vehicle_Age'].unique()

Out[7]:

array(['> 2 Years', '1-2 Year', '< 1 Year'], dtype=object)

In [8]:

vehicle_age_dict = {'> 2 Years':2,'1-2 Year':1,'< 1 Year':0}
data['Vehicle_Age']=data['Vehicle_Age'].map(vehicle_age_dict)

In [9]:

data['Vehicle_Damage'].unique()

Out[9]:

array(['Yes', 'No'], dtype=object)

In [10]:

vehicle_damage_dict = {"Yes":1, "No":0}
data["Vehicle_Damage"]=data["Vehicle_Damage"].map(vehicle_damage_dict)

In [11]:

# Check whether the data have proper data types.
data.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 381109 entries, 1 to 381109
Data columns (total 11 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Gender 381109 non-null int64
1 Age 381109 non-null int64
2 Driving_License 381109 non-null int64
3 Region_Code 381109 non-null float64
4 Previously_Insured 381109 non-null int64
5 Vehicle_Age 381109 non-null int64
6 Vehicle_Damage 381109 non-null int64
7 Annual_Premium 381109 non-null float64
8 Policy_Sales_Channel 381109 non-null float64
9 Vintage 381109 non-null int64
10 Response 381109 non-null int64
dtypes: float64(3), int64(8)
memory usage: 34.9 MB

In [12]:

# Split the train dataset independent variables X and dependent varialbe y.
X, y = data.iloc[:, :-1], data.iloc[:, -1]
X.info()
y
<class 'pandas.core.frame.DataFrame'>
Int64Index: 381109 entries, 1 to 381109
Data columns (total 10 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Gender 381109 non-null int64
1 Age 381109 non-null int64
2 Driving_License 381109 non-null int64
3 Region_Code 381109 non-null float64
4 Previously_Insured 381109 non-null int64
5 Vehicle_Age 381109 non-null int64
6 Vehicle_Damage 381109 non-null int64
7 Annual_Premium 381109 non-null float64
8 Policy_Sales_Channel 381109 non-null float64
9 Vintage 381109 non-null int64
dtypes: float64(3), int64(7)
memory usage: 32.0 MB

Out[12]:

id
1 1
2 0
3 1
4 0
5 0
..
381105 0
381106 0
381107 0
381108 0
381109 0
Name: Response, Length: 381109, dtype: int64

Data Visualization and Preprocess

Let’s use data visualization to see and understand hidden trends, outliers, and patterns in data.

In [13]:

sn.pairplot(data)
plt.show()

In [15]:

plt.figure(figsize=(10, 10))
sn.heatmap(data.corr(),annot=True)
plt.show()

We can see that there is no simple linear relationship between the data from the pair plot and the heatmap of correlation matrix, so we can’t select a part of the variables to train our model. I decided to use all the variables to train the model. In particular, as shown below, the number of response 0s and 1s is not balanced, and what we really need is to classify 1 into 1, so we will focus on the accuracy of classifying 1 into 1. So the evaluation metric for the model is ROC_AUC score, rather than accuracy.

In [ ]:

sn.countplot(y)
plt.show()

Preprocess

Ranges of the features of the dataset are not the same. To address this problem, normalize the ranges of the features into a uniform range, in this case, 0–1.

In [16]:

scaler = MinMaxScaler(feature_range=(0, 1))
X = scaler.fit_transform(X)

Split the data into train and test.

In [17]:

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

Next I will train four different classification models and select the best model and fine tune it.

Logistical Regression

In [24]:

logreg = LogisticRegression(max_iter=10000)
logreg.fit(X_train, y_train)
logreg_pred = logreg.predict(X_test)
print('The accuracy of logistical regression is %.2f%%' % (accuracy_score(y_test, logreg_pred) * 100) )
print('The ROC AUC of logistical regression is %.2f%%' % (roc_auc_score(y_test, logreg_pred) * 100) )
plot_confusion_matrix(logreg, X_test, y_test)
plt.show()
The accuracy of logistical regression is 87.67%
The accuracy of logistical regression is 50.08%

Decision Tree Classifier

In [25]:

dt = DecisionTreeClassifier(criterion='entropy', max_depth=3, random_state=0)
dt.fit(X_train, y_train)
dt_pred = dt.predict(X_test)
print('The accuracy of decision tree classifier is %.2f%%' % (accuracy_score(y_test, dt_pred) * 100) )
print('The ROC AUC of decision tree is %.2f%%' % (roc_auc_score(y_test, dt_pred) * 100) )
plot_confusion_matrix(dt, X_test, y_test)
plt.show()
The accuracy of decision tree classifier is 87.67%
The ROC AUC of decision tree is 50.00%

In [26]:

## plot tree
plt.figure(figsize=(20, 20))
visual_tree = tree.plot_tree(dt, class_names=['0', '1'], feature_names=data.columns[:-1], filled=True, fontsize=10)
plt.show()

Random forest

In [31]:

rfc = RandomForestClassifier(n_estimators=10, max_features=1, random_state=0)
rfc.fit(X_train, y_train)
rfc_pred = rfc.predict(X_test)
print('The accuracy of random forest is %.2f%%' % (accuracy_score(y_test, rfc_pred) * 100) )
print('The ROC AUC of random forest is %.2f%%' % (roc_auc_score(y_test, rfc_pred) * 100) )
plot_confusion_matrix(rfc, X_test, y_test)
plt.show()
The accuracy of random forest is 86.30%
The ROC AUC of random forest is 54.47%

AdaBoost Classifier

In [28]:

ada_boost = AdaBoostClassifier(dt, n_estimators=200, random_state=0, learning_rate=0.05)
ada_boost.fit(X_train, y_train)
ada_pred = ada_boost.predict(X_test)
print('The accuracy of ada boost is %.2f%%' % (accuracy_score(y_test, ada_pred) * 100) )
print('The ROC AUC of ada boost is %.2f%%' % (roc_auc_score(y_test, ada_pred) * 100) )
plot_confusion_matrix(ada_boost, X_test, y_test)
plt.show()
The accuracy of ada boost is 87.68%
The ROC AUC of ada boost is 50.00%

Fintuning the random forest classifier

We can see that all the models used below have the similar accuracy, but the accuracy of classifying 1s of random forest is higher than others. That is to say, the ROC AUC of random forest is best among the four models. So I decide to fine the parameter n_estimates of random forest classifier to get better results using the cross validation.

In [40]:

accs = []
for i in range(1, 20):
rfc = RandomForestClassifier(n_estimators=i, max_features=1, random_state=0)
rfc.fit(X_train, y_train)
scores = cross_val_score(rfc, X_train, y_train, cv=5, scoring=make_scorer(roc_auc_score))
accs.append(scores.mean())

In [47]:

plt.plot(accs)
plt.xlabel('n_estimators')
plt.ylabel('ROC AUC')
plt.show()

We can see that when n_estimators = 1, the ROC AUC of random forest get the best results. So we finally use it as the best model. It’s ROC AUC on test data is 59.50%. Because the returned fitted model has already been pruned under the hood in the sklearn implementation, we don’t need to pruning the model.

In [51]:

rfc = RandomForestClassifier(n_estimators=1, max_features=1, random_state=0)
rfc.fit(X_train, y_train)
rfc_pred = rfc.predict(X_test)
print('The accuracy of random forest is %.2f%%' % (accuracy_score(y_test, rfc_pred) * 100) )
print('The ROC AUC of random forest is %.2f%%' % (roc_auc_score(y_test, rfc_pred) * 100) )
plot_confusion_matrix(rfc, X_test, y_test)
plt.show()
The accuracy of random forest is 82.34%
The ROC AUC of random forest is 59.50%

--

--