Insurance cost prediction using linear regression

Vinayak shukla
5 min readJun 5, 2020

--

Here we proceed with the Second week of Jovian.ml “Deep Learning with PyTorch: Zero to GANs” were we gone use person information to predict price of yearly medial bills which is ultimately used for prediction of Insurance cost .

To get introductory knowledge about PyTorch ,Regression you can visit
Intro to PyTorch ,Linear Regression .

Overview
Here we predict the insurance cost with concept of linear regression and using PyTorch .

The dataset that we are using is taken from here .

At first we need to install packages like conda , pandas, matplotlib, numpy so that we can import useful things of it .

# !conda install numpy pytorch torchvision cpuonly -c pytorch -y
# !pip install matplotlib --upgrade --quiet
#
pip install pandas
!pip install jovian --upgrade --quiet

Now we import :

import torch
import jovian
import torchvision
import torch.nn as nn
import pandas as pd
import matplotlib.pyplot as plt
import torch.nn.functional as F
from torchvision.datasets.utils import download_url
from torch.utils.data import DataLoader, TensorDataset, random_split

Download and explore the data

To download the data as CSV file we use download_url function of PyTorch and to load that dataset into memory read_csv function will be used from Pandas library.

DATASET_URL = "https://hub.jovian.ml/wp-content/uploads/2020/05/insurance.csv"
DATA_FILENAME = "insurance.csv"
download_url(DATASET_URL, '.')
dataframe_raw = pd.read_csv(DATA_FILENAME)

head() will show top 5 records

dataframe_raw.head()

Dataframe is 2-D data structure of Pandas stores data in tabular form of rows and columns ,also due to dataframe we can manipulate that data.
To get number of rows and columns together we can use ,

row_cols = dataframe.shape
print(row_cols)
>> (1271,6)

Prepare the dataset for training

Now we need to convert the data from the Pandas dataframe into a PyTorch tensors for training.

To do this, the first step is to convert it numpy arrays.

def dataframe_to_arrays(dataframe):
# Make a copy of the original dataframe
dataframe1 = dataframe.copy(deep=True)
# Convert non-numeric categorical columns to numbers
for col in categorical_cols:
dataframe1[col] = dataframe1[col].astype('category').cat.codes
# Extract input & outupts as numpy arrays
inputs_array = dataframe1[input_cols].to_numpy()
targets_array = dataframe1[output_cols].to_numpy()
return inputs_array, targets_array

To convert numpy arrays ( input_array and targets_array) into PyTorch tensors.

inputs = torch.from_numpy(inputs_array).type(torch.float32)
targets = torch.from_numpy(targets_array).type(torch.float32)

Now create TensorDataset as we need to create PyTorch datasets and data loaders for training .

dataset = TensorDataset(inputs, targets)

By deciding the fraction so that we can stick that by what fraction of data that will be used for creating the validation set i.e..0.2. Then use random_split to create training & validation datasets.

val_percent = 0.2 
val_size = int(num_rows * val_percent)
train_size = num_rows - val_size


train_ds, val_ds = random_split(dataset, [train_size, val_size])

Now ,we will create data loaders for training & validation , for that we need to fix batch_size . (batch_size=35)

train_loader = DataLoader(train_ds, batch_size, shuffle=True)
val_loader = DataLoader(val_ds, batch_size)

Create a Linear Regression Model

To create model say insurance model we need to pick a good loss function (it’s not cross entropy). Maybe something else try 2–3 of them and choose which works best. You can choose from here .

We defined a class InsuranceModel for linear regression, which takes nn.Module as argument which is Base class for all neural network modules. Our InsuranceModel only contains one simple linear function and we have choosen l1_loss as loss function .

class InsuranceModel(nn.Module):
def __init__(self):
super().__init__()
self.linear = nn.Linear(input_size, output_size)

def forward(self, xb):
out = self.linear(xb)
return out

def training_step(self, batch):
inputs, targets = batch
# Generate predictions
out = self(inputs)
# Calcuate loss
loss = F.l1_loss(out, targets)
return loss

def validation_step(self, batch):
inputs, targets = batch
# Generate predictions
out = self(inputs)
# Calculate loss
loss = F.l1_loss(out, targets)
return {'val_loss': loss.detach()}

def validation_epoch_end(self, outputs):
batch_losses = [x['val_loss'] for x in outputs]
epoch_loss = torch.stack(batch_losses).mean()
return {'val_loss': epoch_loss.item()}

def epoch_end(self, epoch, result, num_epochs):
# Print result every 20th epoch
if (epoch+1) % 20 == 0 or epoch == num_epochs-1:
print("Epoch [{}], val_loss: {:.4f}".format(epoch+1, result['val_loss']))

For instantiation of our class

model = InsuranceModel()

Train the model to fit the data

Now we use two function first evaluate function to calculate the loss on the validation set before training and second fit function to which we pass different number of epochs and with different learning rates, to get a good result . We can repeat this until we get good result and if loss appear too large then re-initialize the model.

def evaluate(model, val_loader):
outputs = [model.validation_step(batch) for batch in val_loader]
return model.validation_epoch_end(outputs)

def fit(epochs, lr, model, train_loader, val_loader, opt_func=torch.optim.SGD):
history = []
optimizer = opt_func(model.parameters(), lr)
for epoch in range(epochs):
# Training Phase
for batch in train_loader:
loss = model.training_step(batch)
loss.backward()
optimizer.step()
optimizer.zero_grad()
# Validation phase
result = evaluate(model, val_loader)
model.epoch_end(epoch, result, epochs)
history.append(result)
return history

This used to print loss for us

result = evaluate(model, val_loader) # Use the the evaluate function
print(result)

Change epochs value 6 times and take le-1 to le-6 respectively .

epochs = 250
lr =1e-6
history5 = fit(epochs, lr, model, train_loader, val_loader)

After calling aboive method 6th times ,we get our final validation loss of your model by :

val_loss = [result] + history1 + history2 + history3 + history4 + history5
print(val_loss)
val_loss_list = [vl['val_loss'] for vl in val_loss]

plt.plot(val_loss_list, '-x')

plt.xlabel('epochs')
plt.ylabel('losses')

Make predictions using the trained model

To make prediction on a single input define a function named predict_single which has input,target (from validation data set) and model as parameters

def predict_single(input, target, model):
inputs = input.unsqueeze(0)
predictions = model(inputs)
prediction = predictions[0].detach()
print("Input:", input)
print("Target:", target)
print("Prediction:", prediction)
input, target = val_ds[0]
predict_single(input, target, model)
input, target = val_ds[10]
predict_single(input, target, model)
input, target = val_ds[23]
predict_single(input, target, model)

The last call of function returns :

Input: tensor([57.0000, 0.0000, 23.3415, 0.0000, 0.0000])
Target: tensor([13232.2158])
Prediction: tensor([13943.2725])

If you are not satisfies model’s predictions then Try to improve them further.

Reference

Notebook — insurance cost using LR

--

--