Travel Insurance Prediction: Journey from dataset selection to UI Based Prediction

Ankit Goyal
AI Skunks
Published in
11 min readDec 16, 2022

Through this article I want to explain the whole process of creating a simple data science project right from selecting a dataset for prediction to creating a model to finally deploying on a server to have our model predict based on User-based feature inputs. To keep this article short, I have covered all the steps in brief but feel free to refer to my Jupyter Notebook here to follow along and dive deeper into every step of the process. Hope you enjoy reading this article!

1. About the Dataset

A tour & travels company is offering travel insurance package to their customers. The new insurance package also includes covid cover. The company wants to know which customers would be interested to buy it based on their database history. The insurance was offered to some of the customers in 2019 and the given data has been extracted from the performance/sales of the package during that period. The data is provided for almost 2000 of its previous customers and we are required to build an intelligent model that can predict if the customer will be interested to buy the travel insurance package.

Kaggle Dataset Link: https://www.kaggle.com/datasets/tejashvi14/travel-insurance-prediction-data

Column Description for our Dataset

Target Variable/Dependent Variable

TravelInsurance — Did the customer buy travel insurance package during introductory offering held in the year 2019. This is the variable we have to predict

Predictor Variables/Independent Variables

  1. Age — Age of the customer
  2. Employment Type — The sector in which customer is employed
  3. GraduateOrNot — Whether the customer is college graduate or not
  4. AnnualIncome — The yearly income of the customer in indian rupees[rounded to nearest 50 thousand rupees]
  5. FamilyMembers — Number of members in customer’s family
  6. ChronicDiseases — Whether the customer suffers from any major disease or conditions like Diabetes/high BP or Asthma, etc
  7. FrequentFlyer — Derived data based on customer’s history of booking air tickets on at-least 4 different instances in the last 2 years[2017–2019]
  8. EverTravelledAbroad — Has the customer ever travelled to a foreign country
#loading the file
df_train=pd.read_csv('TravelInsurancePrediction.csv')

#defining the response and the predictors
y_total=df_train['TravelInsurance']
data = df_train.drop(['Unnamed: 0'],axis=1)
df_train=df_train.drop(['TravelInsurance', 'Unnamed: 0'],axis=1)
df_train.head()

2. Exploratory Data Analysis

Q. Why do we really need this step? What if we skip this and jump directly to creating models?

A dataset can have end number of problems like null or blank values, outliers, unbalanced dataset and many more! Now if we create models without EDA(Exploratory Data Analysis), there is a high chance our end results will be erroneous. So lets briefly discuss some EDA techniques -

2.1 What are the Datatypes of the variables in our dataset

#Getting the list of categorical columns
categorical_cols = data.select_dtypes(include=['object']).columns.tolist()

#Getting the list of numerical columns
numerical_cols = data.select_dtypes(exclude=['object']).columns.tolist()

#Printing the list of categorical and numerical columns
print("--------------------------------------------------------")
print(" Categorical Variables ")
print("--------------------------------------------------------")
print(f'Total number of categorical variables in our dataset: {len(categorical_cols)}')
for row,col in enumerate(categorical_cols):
print(f'{row+1}. {col}')
print("\n")
print("--------------------------------------------------------")
print(" Numerical Variables ")
print("--------------------------------------------------------")
print(f'Total number of num variables in our dataset: {len(numerical_cols)}')
for row,col in enumerate(numerical_cols):
print(f'{row+1}. {col}')

--------------------------------------------------------
Categorical Variables
--------------------------------------------------------
Total number of categorical variables in our dataset: 6
1. Employment Type
2. GraduateOrNot
3. ChronicDiseases
4. FrequentFlyer
5. EverTravelledAbroad
6. TravelInsurance


--------------------------------------------------------
Numerical Variables
--------------------------------------------------------
Total number of num variables in our dataset: 3
1. Age
2. AnnualIncome
3. FamilyMembers

2.2 Are there missing values? Which independent variables have missing data? How much?

#Checking missing values in our data
data.isnull().sum()

Age 0
Employment Type 0
GraduateOrNot 0
AnnualIncome 0
FamilyMembers 0
ChronicDiseases 0
FrequentFlyer 0
EverTravelledAbroad 0
TravelInsurance 0
dtype: int64

Observations

  • We have 0% missing values both in our independent variables as well as dependent variable

2.3 What are the likely distribution of the numeric variables?

Lets check the distribution for Income as an example, same can be done for other independent variables as well!(Refer to the notebook here)

plt.figure(figsize=(15,7))
sns.distplot(data['AnnualIncome'], color = "darkgreen")
print("The mean income is ",round(data['AnnualIncome'].mean(),2))
print("The median income is ",data['AnnualIncome'].median())
plt.title('Distribution curve for Annual Income')
plt.xlabel("Annual Income(in Millions)", size=14)
plt.ylabel("Density", size=14)
plt.title('Distribution curve for Annual Income', size=20)

The mean income is 932762.96
The median income is 900000.0

Observations:

  1. As per the graph the income distribution follows modified normal distribution
  2. There is a significant drop in the number of people from an income greater than 1.6M to 1.8M
  3. Here also the mean and median are almost the same

2.4 Does the range of our independent variables make sense? Potential Outliers?

#Checking the Ranges of the predictor variables together after normalization of numerical variables
plt.figure(figsize=(20,7))
sns.boxplot(data=x, palette="Set3")
plt.title("Box plot of predictor variables of the dataset", size=14)

Observations:

  • Majority of the travellers have an age greater than the median age of ~29 years
  • No of family above and below the median is almost same
  • The income group appears to be normally distributed as the 25–50 percentile and 50–75% have almost similar area
  • Number of travellers without a Graduate degree is low
  • Number of travellers who are frequent flyers or have travelled abroad is also low
  • There are no outliers in our dataset

2.5 How do we check if one or more of our independent variables are correlated?

This is one of the many important steps as part of EDA. It is important to know the correlation between variables because of reasons like, if your feature columns are very similar, we can say that the model has been trained only on one feature which will not give us great results! So how to check for correlation?

  1. Using Correlation Matrix
# Lets check the correlation among the predictor variables using a correlation matrix
x.corr()

2. Using HeatMap

#the heat map of the correlation
plt.figure(figsize=(20,7))
sns.heatmap(x.corr(), annot=True, cmap='RdYlGn')

Observations:

  • It is very clear from the heatmap that most of the variables are not dependent on each other
  • Degree of collinearity is significantly less that 0.1 for most variables
  • The Annual income and the international travel history of the travellers have a degree of collinearity of 0.49 which still is not a significant dependency of variables on each other

2.6 Do the training and test sets have the same data?

In this step, we want to check if the distribution of values in out training and testing dataset is very different or similar. For example, take “AnnualIncome” variable in our dataset, there is a possibility that the test dataset has income values less that 20K while the range of values in our train set is all above 100K. Than our predictions might not be very accurate.

2.7 Normalizing our dataset as our features are not on the scale

# Normalizing our data for proper analysis

# list of numerical columns which require normalization
num_cols=['AnnualIncome','Age', 'FamilyMembers']

# Importing required library from sklearn for normalization
from sklearn import preprocessing
feature_to_scale = num_cols

# Preparing for normalizing
min_max_scaler = preprocessing.MinMaxScaler()

# Transform the data to fit minmax processor
df_train[feature_to_scale] = min_max_scaler.fit_transform(df_train[feature_to_scale])

df_train.head()

3. Creating a Bunch of Models!

3.1 Fitting a Logistic Regression Model

import statsmodels.api as sd
train_X, val_X, train_y, val_y = train_test_split(df_train, y_total, random_state=1)
log_reg = sd.Logit(train_y,train_X).fit()
print(log_reg.summary())

Observations of Coefficients

  1. From the table we can clearly see that the p value of Age,Employment Type, Annual Income, Family Members,Frequent Flyer and EverTravelledAbroad are less than 5% and hence they are important features
  2. Features such as Chronic Diseases and GraduateOrNot have p value more than 5% and hence they are not important features
  3. Though we can see that the p-values are different for all our features, we can’t actually rank them at this point. What we can do next? — Run Shap Analysis( We will do this model interpretability)

3.2 Fitting a tree based model and interpreting the coefficients

# Creating the decision tree classifier model
my_model = DecisionTreeClassifier(random_state=42).fit(train_X, train_y)

# Checking the feature importance using the decision tree model
plt.figure(figsize=(15,6))
importances = my_model.feature_importances_
features = list(train_X.columns)
plt.title("Decision Tree Model : Feature Importance")
plt.xlabel('Feature importance')
plt.ylabel('Features')
plt.barh(features,importances,color='g')

Observations

  1. It is clear from the above plot that Annual Income is the most important feature for predicting the decision of buying a travel insurance

2. Features such as Family Members and Age are also some other important features

Plotting the Decision Tree


fig = plt.figure(figsize=(20, 14))
vis = tree.plot_tree(my_model, feature_names = features, class_names = ['Insurance Taken', 'Insurance Not - Taken'], max_depth=4, fontsize=9, proportion=True, filled=True, rounded=True)

Observations

  1. It is clear from the decision tree chart is that Annual Income is the first node which is causing splitting of samples
  2. Roughly 83% of the samples have a relatively less salary as compared to a few wealthy ones

3.3 Using an AutoML Model

What is AutoML? — AutoML or Automated Machine Learning is a machine learning method that automates the training, tuning, and deploying machine learning models. AutoML can be used to automatically discover the best model for a given dataset and task without any human intervention.

#defining the environment variables of h2o
import psutil
import h2o
from h2o.automl import H2OAutoML
import random, os, sys
from datetime import datetime
import logging
import optparse
import time
import json
min_mem_size=6
run_time=60
pct_memory=0.5
virtual_memory=psutil.virtual_memory()
min_mem_size=int(round(int(pct_memory*virtual_memory.available)/1073741824,0))
print(min_mem_size)
port_no=random.randint(5555,55555)

# h2o.init(strict_version_check=False,min_mem_size_GB=min_mem_size,port=port_no) # start h2o
try:
h2o.init(strict_version_check=False,min_mem_size_GB=min_mem_size,port=port_no) # start h2o
except:
logging.critical('h2o.init')
h2o.download_all_logs(dirname=logs_path, filename=logfile)
h2o.cluster().shutdown()
sys.exit(2)

# Importing our file
df_h = h2o.import_file('TravelInsurancePrediction.csv')
df_h = df_h.drop(['C1'],axis=1)

df_h.head()

# Doing the train-test split
pct_rows=0.80
df_train, df_test = df_h.split_frame([pct_rows])
print('Train dataframe size')
print(df_train.shape)
print('Test dataframe size')
print(df_test.shape)
#defining the predictor and response variable for our model
X=df_h.columns
y ='TravelInsurance'
# df_h['TravelInsurance'] = df_h['TravelInsurance'].apply(convert_binary_to_yesno)
X.remove(y)
print('The predictor variables are as follows')
print(X)
print('The response variable is')
print(y)

# Output
# Train dataframe size
# (1569, 9)
# Test dataframe size
# (418, 9)
# The predictor variables are as follows
# ['Age', 'Employment Type', 'GraduateOrNot', 'AnnualIncome', 'FamilyMembers', 'ChronicDiseases', 'FrequentFlyer', 'EverTravelledAbroad']
# The response variable is
# TravelInsurance

# Starting our model training
aml = H2OAutoML(max_runtime_secs=run_time, seed=1)
aml.train(x=X,y=y,training_frame=df_train)
print('Training Successful....')

4. Understanding our Models and Evaluating them

First, lets do Shap Analysis followed by evaluation metrics such as log-loss, confusion matrix, accuracy etc for our models created in Step 3.But before we proceed, what is Shap?

SHAP is a mathematical method to explain the predictions of machine learning models. It is based on the concepts of game theory and can be used to explain the predictions of any machine learning model by calculating the contribution of each feature to the prediction.

4.1.1 Shap for Logistic Regression Model

clf_model = LogisticRegression(random_state=0).fit(train_X, train_y)
explainer = shap.LinearExplainer(clf_model, train_X, feature_perturbation="interventional")
shap_values = explainer.shap_values(val_X)

shap.summary_plot(shap_values, val_X, feature_names=list(train_X.columns))

Observations from the Shap summary

  1. People who have high income, more family members, older and ever travelled abroad seem to more readily buy a travel insurance
  2. People who fly less, have significantly less annual income, are younger seem to not buy a travel insurance. This makes sense because younger people usually don’t earn much so they fly less and therefore might not buy Travel insurance

4.1.2 Evaluating our Logistic Regression Model

# acc=sm.accuracy_score(val_y,log_reg.predict(val_X))
log_loss=sm.log_loss(val_y,log_reg.predict(val_X))
auc=sm.roc_auc_score(val_y,log_reg.predict(val_X))
# confusion_matrix=sm.confusion_matrix(val_y,log_reg.predict(val_X))
print("-------------------------------")
print(f'AUC: {auc:.2f}')
print(f'Log Loss: {log_loss:.2f}')
print("-------------------------------")
-------------------------------
AUC: 0.68
Log Loss: 0.56
-------------------------------

Observations:

  1. Our AUC value is 0.68 which is more that 0.5 so we can say that our logistic regression model is doing fairly well!

We can perform the same steps for our other two models, that is Decision tree classifier model and AutoML model( Refer to the notebook)

5. Model Selection

Now, to give a quick recap, we did some initial exploratory data analysis, creating a few models like the logistic regression model, Decision tree classifier model and finally the AutoML model. AutoML model, serves its purpose to give us the best model as it automatically discover the best model for a given dataset and task without any human intervention.

So why did we even create the other two models? My goal was to explain how you can try individual models, interpret them and finally evaluate them. For the purpose of simplicity, we will move forward with deploying the decision tree classifier model!

6. Creating Pickel File

Pickle in Python is primarily used in serializing and deserializing a Python object structure. In other words, it’s the process of converting a Python object into a byte stream to store it in a file/database, maintain program state across sessions, or transport data over the network.

### Create a Pickle file using serialization 
import pickle
pickle_out = open("tress_model.pkl","wb")
pickle.dump(my_model, pickle_out)
pickle_out.close()

Once you execute this step, you will have a .pkl file in your directory which we will use to make predictions based on inputs given by the end-user

7. Creating our App and Deploying on the server using Streamlit & Heroku

What is Streamlit?

Streamlit is an open source app framework in Python language. It helps us create web apps for data science and machine learning in a short time. It is compatible with major Python libraries such as scikit-learn, Keras, PyTorch, SymPy(latex), NumPy, pandas, Matplotlib etc.

You can follow the steps below to create a streamlit application or feel free to refer to this video by Krish Naik

  1. In your command prompt to install streamlit — pip install streamlit
  2. Create a .py file with your code for prediction( Please see this for reference) and make sure you have streamlit library imported in the code — import streamlit as st
  3. Next, open your .pkl file we created earlier in read mode
pickle_in = open("tress_model.pkl","rb")
classifier=pickle.load(pickle_in)

4. Working on the App UI — Write a title for your application and a bunch of input variables which is basically the feature value input from the end-user for predicting if the decision for buying the travel insurance is Yes or No

5. At this point, the application we created is running on you local machine. If you now wish to deploy it online on a cloud based web server, you can use Heroku. Please feel free to refer to this amazing video by Krish Naik for reference

What is Heroku?

Heroku is a container-based cloud Platform as a Service (PaaS). Developers use Heroku to deploy, manage, and scale modern apps. Our platform is elegant, flexible, and easy to use, offering developers the simplest path to getting their apps to market.

8. Time to Make Some Predictions!

Voila! If you made it to this step, I want to thank you for taking the time to read through the entire article! Now you have a fun application for users to try a bunch of feature values and check the predicted result by our model! The entire process can be repeated for any other dataset. I hope you had fun reading this article! Please upvote if you like it!

Link to WebApp — https://travel-insurance-prediction.herokuapp.com

Citations & References

1.Many techniques used in this article and the notebook have been adopted from the following GitHub repositories

Owner — AI Skunkworks

Link — https://github.com/aiskunks/Skunks_Skool

Author name — Prof Nik Bear Brown

Link — https://github.com/nikbearbrown/

Author — Krish Naik

Link — https://github.com/krishnaik06/Dockers, https://www.youtube.com/watch?v=5XnHlluw-Eo&t=93s&ab_channel=KrishNaik, https://www.youtube.com/watch?v=IWWu9M-aisA&ab_channel=KrishNaik

2. The methods and parameters of the models and code corrections have been adapted from stackoverflow

Link — https://stackoverflow.com

3.Reference has been taken from the seaborn webpage for charts and visualization

Link — https://seaborn.pydata.org

4.The methods and parameters of the AutoMl model have been adapted from the h2o documentation

Author — H2O.ai

Link — https://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science/glm.html

--

--

Ankit Goyal
AI Skunks

Graduate Student @ Northeastern University, Boston