Insurance cost prediction — linear regression

6 min readJun 11, 2020

Week 2 of Zero to GANs Deep Learning with PyTorch course

Click here for the course & here for my assignment submission

Time flies when you’re having fun. Second week was starting to get tougher, not the homework but the content in general.

I felt annoyed at myself for not putting in enough time to understand the craziness that goes on behind the functions introduced.

When I started putting in the hours though, everything fell into place like 1, 2, 3.

Ok, onto the actual meat: there are 5 steps to building this linear regression model. I’ll take you through it one at a time.

Step 1: Download and explore the data

customize_dataset() function accepts two inputs; the dataset and a random string [with a length greater than 5].

What this function is doing is customising the data so everyone doing the assignment doesn’t get the same solution.

dataset.copy(deep=True) — we are deep copying the dataset, which just means that if the original dataset changes, the copy will be of the unchanged version of it.

dataset.sample(n, random_state) — we are randomising the order of the dataset.

n is just the number of rows you’d like to return from the dataset. In the assignment, we dropped a few son=1271 .

random_state is a property where if it is equated to any digit, then the randomised sample is kept the same. If it’s equated to None, then every occasion dataset.sample() is called, the rows will be of another order.

Since random_state can be any digit, to randomise this — again, for the sake of not having the same as other candidates of the course — we give the function a random string. The unicode of the first element on the string is the digit we’re going to give random_state. To read more about unicode, click here.

In this function, we’re also changing the values of some of our input and output data by multiplying it out with a digit that’s roughly between 0 and 2 — again, using unicode.

To summarise, step one isn’t necessary. It was solely to randomise data so plagiarism is easier to spot.

Step 2: Prepare the dataset for training

Firstly, we separate the columns into 3 lists; input_cols, output_cols, categorical_cols.

The purpose of this project is to predict insurance charges based on many factors. That being the case, we know the output_cols = ['charges'] .

Following that, input_cols = ['age', 'sex', 'bmi', 'children', 'smoker', 'region'] — basically, the rest.

Categorical columns are the ones that doesn’t have values defining it so like, sex because it’s either female or male. categorical_cols = ['sex', 'smoker', 'region'] What we’re going to do next is look at a line of code that changes these non-numeric values into digits.

data = dataset.copy(deep=True)
for col in categorical_cols:
   data[col] = data[col].astype('category').cat.codes

Part of what this line of code is doing is it compresses the list of repetition into a list of unique elements. for example: ['a', 'b', 'c', 'a', 'b', 'd', 'a', 'b', 'c', 'b', 'b', 'c', 'a', 'a'] becomes ['a', 'b', 'c', 'd']

The next part of the code takes each element in the bigger list, and maps it to the index it takes in the list with unique elements, returning a list of indices — where the repeated elements take on the same index. output: [0, 1, 2, 0, 1, 3, 0, 1, 2, 1, 1, 2, 0, 0]

Why do we go through all this trouble? Well, if we mapped the first list to its own index without compressing it, it will imply that each element is unique. This is bad. Because we wouldn’t be able to pick up on any patterns/trends.

Females may have a higher insurance charge, but we wouldn’t be check if we didn’t map each female to the same component.

I hope this makes sense.

Anyway, with that out of the way, let’s look at the next function we create.

dataframe_to_arrays() accepts one input; the dataset. It turns dataframes into arrays.

Arrays don’t accept strings as elements, so the line of code defined above was essential.

input_array = data[input_cols].to_numpy() — that’s how easy it is.

Now to convert them to tensors:

inputs = torch.from_numpy(inputs_array).float() — .float() is optional, I added that there to make sure inputs are of the type torch.float32. It’s better working with float32 because it takes up less memory than float64 and less compute time is required.

Since we’re dealing with a very large dataset, we want the compute time to be as optimal as the accuracy.

Next we split the dataset into two; training & validation parts.

Validation part is usually between 10–20% of the dataset size. So I went down the middle and assigned it 15%.

Step 3: Create a Linear Regression Model

Uffff, it’ll start to get more interesting from here (I promise).

We’re working with a class here with multiple methods within it.

nn.Linear() applies y = x @ A.t() + b to incoming data - x
forward method: out is application of the step above to xb

We have a method each for training and validation to distinguish the loss for the data in batches (in our case batch_size = 10)

We’re comparing our predictions to our actual targets using a loss function l1_loss.

This loss function is the sum of the all the absolute differences between the true value (targets) and the predicted value (out).

Read more about loss functions here.

The validation_epoch_end method is to average the batches of data to return the overall loss.

Finally, the epoch_end is to print out the average loss up to every 20th epoch.

That summarises briefly what each of them do. Now to actually apply to our case.

Step 4: Train the model to fit the data

evaluate() accepts two inputs; model & val_loader.

val_loader is a data loader that puts out the validation data in the batches of the size specified.

out = [model.validation_step(batch) for batch in val_loader]
return model.validation_epoch_end(out)

This works out the loss of the data in the batches of 10, stores it as a list in ‘out’. Then the validation_epoch_end is used to average the losses.

fit() accepts 6 inputs; epochs, lr (learning rate), model, train_loader, val_loader, opt_func=torch.optim.SGD.

All the inputs here are self explanatory except opt_func. What the hell is opt_func was the first question I asked myself when I looked at the function fit().

OK. This is what I understand after further researching into it.

During the training process, we tweak and change the parameters (weights) of our model to try and minimise the loss, to make our predictions as accurate and optimised as possible.

Optimisers help us accomplish this by updating the model in response to the output of the loss. They shape and mould your model into its most accurate possible form by playing around with the weights. The loss function is the guide, telling the optimiser when it’s moving in the right or wrong direction.

So if the loss is high, then the optimiser knows that it’s going in the wrong direction. It’s impossible to know what weights to start with for the best optimisation so trial and error is essential.

There are many optimisers we can choose from (find them on the pytorch document here), but the one we use is stochastic gradient descent — the most popular one.

Here’s how it works:

Calculate what a small change in each individual weight would do to the loss function
Adjust each individual weight based on its gradient
Keep doing steps #1 and #2 until the loss gets as low as possible

The effect a small change in the parameter (weight) has on the loss is represented by gradients. Gradients are partial derivatives; they tell us what specific operation we should do to our weights to lower the loss and thereby make our model more correct.

We also don’t want to take large steps, because that wouldn’t lead to the convergence to the optima. So we introduce a variable called learning rate — this thing is just a very small number, usually something like 0.001, that we multiply the gradients by to scale them.

optimiser = opt_func(model.parameters(), lr)
loss.backward() #gradient 
optimiser.step() #updates the weights (parameters)
optimiser.zero_grad() #resets the gradient back to 0

The last step is important because PyTorch accumulates the gradients on subsequent backward passes. Every time loss.backward() is called, it’ll return the sum of gradients so it’s best to clear it for each iteration of the for loop.

Now to make actual predictions.

Step 5: Make predictions using the trained model

predict_single() accepts three inputs; input, target, model.

inputs = input.unsqueeze(0) #changes the shape from [6] to [1,6]
prediction = predictions[0].detach() #removes the gradient function

This brings us to end of second week’s work. Hopefully, you make sense of my explanations.

If you guys have any questions or need more clarity on certain parts, drop it down below :)

Insurance cost prediction — linear regression

Written by Thenu