# Modeling Insurance Claim Severity

**An illustrative guide to model insurance claim severity using generalized linear models in Python & R**

**Objective**

I am sure most of us started our analytics journey by building a linear regression with a small sample of data, where the response variable follows a beautiful bell-shaped normal distribution. In the real world, rarely do we find such ideal datasets, especially in the Insurance industry often our datasets are highly skewed and fitting an ordinary least square doesn’t make sense. I will discuss multiple variants of generalized linear models to fit this kind of data better.

**Background**

In the previous article, we discussed key challenges and modeling techniques related to claim frequency, in the current article we will discuss the second component of pricing, i.e., claim severity (expected claim amount per claim). As we have seen claim frequencies are highly skewed, it is expected with severity to follow a similar trend as both values are related. The first one is the number of loss occurrence and the other one is loss amount caused by those occurrences.

**What are the various challenges we face while modeling claim severity?**

1. The claim severity is highly skewed (right-skewed), with a sharp peak and a long tail to the right. You can refer to the following histograms.

2. Severity often has an excessive number of zero outcomes causing it difficult to model.

Here is a snapshot of the claim cost distribution, the blue graph includes all the data including zero values and the green graph has only positive values.

You can imagine why this special case of the regression is important to us.

**How to model such skewed data?**

The generalized linear model with gamma distribution is the first choice of techniques among actuaries and analytics professionals while modeling claim severity. Another popular technique is OLS with the log-transformed response variable.

## Let’s discuss gamma distribution before moving to the modeling steps, this will give you an idea of how this distribution might help us in solving our problem.

Gamma distribution is a two-parameter family of continuous probability distributions and arises naturally in processes for which the waiting times between the events are relevant and follow a Poisson process.

There are three different parameterizations in common use:

1. With a shape parameter

kand a scale parameterθ.2. With a shape parameter

α=kand an inverse scale parameterβ= 1/θcalled a rate parameter.3. With a shape parameter

kand a mean parameterμ=kθ=α/β.In each of these three forms, both parameters are positive real numbers.

The first is the k−

θparametrization and perhaps is the more natural one, with p.d.f.

**Now, we can take a look at its probability density function**

**Probability density function**: The waiting time until the *hth* Poisson event with a rate of change λ is

For X ~ Gamma(*k,θ)*, where *k* = *h* and θ = 1 / λ, the gamma probability density function is given by

*Where,*

e is the natural number (e = 2.71828…)

k is the number of occurrences of an event

if k is a positive integer, then Γ(k) = (k − 1)! is the gamma function

θ = 1 / λ is the mean number of events per time unit, where λ is the mean time between events. For example, if the mean time between phone calls is 2 hours, then you would use a gamma distribution with θ=1/2=0.5. If we want to find the mean number of calls in 5 hours, it would be 5*1/2 = 2.5

x is a random variable

## Here are some real-world applications of the gamma distribution

The gamma distribution can be used in a range of disciplines including financial services. Examples of events that may be modeled by gamma distribution include:

1. The amount of rainfall accumulated in a reservoir

2. The size of loan defaults and insurance claim cost

3. The flow of items through manufacturing and distribution processes

4. The load on web servers

## Now, I will walk you through a working example using an insurance dataset

This dataset (dataCar)can be downloaded from an R package called “insuranceData”.

`library(insuranceData)`

data(dataCar)

**Data Description**

This data set is based on one-year vehicle insurance policies taken out in 2004 or 2005. There are 67,856 policies, of which 4624 (6.8% notified claims) filed claims.

**Lets deep dive into the data**

We must analyze all the variables independently and also with respect to the dependent variable, here I am going to discuss only selected variables where some transformation or intervention is required. Though you can find complete EDA in python codes shared along with this article.

As a first step, I have dropped a few duplicate observations.

**Claim Cost**

As we already know we are going to work on a scenario where our response variable is not normally distributed. We can observe the same in the below graph. Only 7% of policies reported claims in our observation, out of that around 40% policies claimed less than $500.

Looking at the above graph and based on one important property of gamma, i.e., all outcomes must be positive, we can not model our data in the current form where we have zero outcomes for 93% of observations.

We will split this dataset into two parts, observations without any loss and observations which generated even a single dollar loss.

In the current article, we will focus on modeling only positive losses, we have already discussed another scenario where we want to know whether a policy has reported any claim or not (same as a policy has caused any loss or not) in my last article using zero-inflated and hurdle models. One part of the model predicted the probability of zero claims and another part predicted various claim counts.

**Claim Cost for the positive subset of the data**

After analyzing the positive subset of the data, one can observe many values beyond 95th percentile, I removed only extreme outliers falling beyond the 99th percentile. We can see two histograms below, one on complete data and the second one after removing extreme outliers.

**Vehicle Value**

We can observe many strange values for this variable, there are some vehicles with zero values and some with extreme values which doesn’t make sense to me. During live projects, we can clarify this kind of suspicious value and accordingly we can decide. In this case, I have removed observations having vehicle value as zero.

Still, I can see too many outliers, we can convert continuous values into categorical to minimize the effect of outliers. I have created 5 different categories for vehicle value.

**Vehicle Body**

There are many categories with the very low frequency which could be a disadvantage in training the feature in lack of enough data points, let’s merge the bottom 4 categories as “others”.

Now, we are done with the necessary data check and transformation. We can move to the modeling steps.

**Its time to explore an appropriate GLM model for this kind of dataset**

As we already discussed in the beginning based on industry experience that GLM with gamma distribution and OLS with log-transformed response might be a better fit for this scenario.

Here is a comparison of the original distribution versus its log-transformed value.

The first (blue) histogram gives an intuition that data is very close to gamma distribution and later one (green)suggests we can try a log-normal model, though it is far from perfectly normal, has two peaks in the beginning.

We have one more challenge here, we have to find out a technique that mathematically replaces claim amount with a claim severity. We wanted to influence the model results without actually including this variable (number of claims) in modeling. This we can achieve by using an offset term in the model and OLS is not a suitable technique for this requirement.

**We will see why GLMs are preferable to accommodate offset feature**

The GLM model is usually set up as below:

The final equation indicates that the offset must be on the same scale as the linear predictor f(x). Therefore, log(claim count) can be used as an offset.

Now, let’s take an example of OLS, the model is set up as below.

Therefore, we can see that using offset will not make sense here.

**An alternate to OLS — Generalized Linear Model with Gaussian Distribution**

As an alternate to OLS, we can use GLM with Gaussian distribution which is equivalent to the OLS model when the “identity” link function is used. But, there is a slight difference when we use the “log” link function. The first approach log transforms observed values, while the second one log transforms the expected value.

Now, we have two GLM options to choose from, Gaussian with log link and another one is Gamma with a log link, we can fit models on both the variants and after assessment and comparison, finalize the better one.

**Modeling using Statsmodels package in python and MASS package in R**

*Target Variable* — We will use the claim cost as the target variable

*Independent Variables* — All these variables are used as categorical — Vehicle Body, Vehicle Age, Gender, Area and Age category, and Vehicle value. Claim indicators and exposure are dropped as they are not required.

*Offset* — Number of claims will be used as an offset term

**Lets model**

As a first step, lets split the final dataset into two subsets training and test

`mask = np.random.rand(len(df_excl_0_99)) < 0.8`

df_train = df_excl_0_99[mask]

df_test = df_excl_0_99[~mask]

print('Training data set length='+str(len(df_train)))

print('Testing data set length='+str(len(df_test)))

Setting up model expression is fairly easy in statsmodels, simply we can write R like formulae.

`expr = """claimcst0 ~ veh_value_cat+veh_body+veh_age+gender+area+agecat"""`

Convert the training and test datasets in dmatrices format which is a required input for this package.

`y_train, X_train = dmatrices(expr, df_train,return_type='dataframe')`

y_test, X_test = dmatrices(expr, df_test, return_type='dataframe')

Let’s have a look at the model expression, for offset terminology varies from one package to another. In R, it’s named as offset and the modeler has to give logged value, here in statsmodels, the model itself takes care of log.

**GLM with Gaussian distribution and log link function**

Python---

gauss_log_model= sm.GLM(y_train,X_train,exposure=df_train.numclaims, family=sm.families.Gaussian(sm.families.links.log))

model1 = gauss_log_model.fit()R---

model_gauss <- glm(claimcst0 ~ veh_value+veh_body+veh_age+gender+area+agecat,data = train,offset = log(numclaims),family=gaussian(link="log"))

**GLM with Gamma distribution and log link function**

Python---

gamma_model = sm.GLM(y_train, X_train, exposure=df_train.numclaims, family=sm.families.Gamma(link=sm.families.links.log))

model2 = gamma_model.fit()R---

model_gamma <- glm(claimcst0 ~ veh_value+veh_body+veh_age+gender +area +agecat, data = train,offset = log(numclaims), amily=Gamma(link="log"))

**Model Assessment and Comparison**

After modeling, we can extract the results using a summary function and observe the goodness of fit for both the cases. We can validate root mean squared error on the test dataset.

We can clearly see either of the models are not perfect, but still, gamma with log link is slightly better than the gaussian. The pattern of spread in residuals against fit is better in gamma and as per QQ plot residuals are very close to the normal distribution. This is a GLM, not the linear regression where normality of residuals and homogeneity of variance is a strict condition, some deviation is expected but substantial deviation from normality indicates an issue with the assumed distribution.

Finally, AIC and root mean squared error is another indicator of model comparison, both the values are lower in the case of gamma which suggests a better fit.

**You can refer complete R Codes below**

library(ggplot2)

library(dplyr)

library(class)

library(MASS)

library(caret)

library(devtools)

library(countreg)

library(forcats)

library(insuranceData)

library(Hmisc)

install.packages("countreg", repos="http://R-Forge.R-project.org")#Attaching data for modeling

data(dataCar)

data1 <- dataCar#Data Cleaning & Pre-processingdata1$veh_value_cat <- as.numeric(cut2(data1$veh_value, g=5))data2 <- unique(data1)

data3 <- data2[data2$veh_value > quantile(data2$veh_value, 0.0001), ]

#data4 <- data3[data3$veh_value < quantile(data3$veh_value, 0.999), ]#Regrouping vehicle categories

top9 <-c('SEDAN','HBACK','STNWG','UTE','TRUCK','HDTOP','COUPE','PANVN','MIBUS')data3$veh_body <- fct_other(data3$veh_body, keep = top9, other_level = 'other')#Converting catagorical variables into factors

names <- c('veh_body' ,'veh_age','gender','area','agecat','veh_value_cat')

data3[,names] <- lapply(data3[,names] , factor)

str(data3)# based on variable values

newdata <- subset(data3, clm ==1)df <- newdata[newdata$claimcst0 < quantile(newdata$claimcst0, 0.99), ]##data partition - original data

data_partition <- createDataPartition(df$claimcst0, times = 1,p = 0.8,list = FALSE)

str(data_partition)

train <- df[data_partition,]

test <- df[-data_partition,]#models - Gaussianmodel_gauss <- glm(claimcst0 ~ veh_value+veh_body+veh_age+gender+area+agecat,

data = train,offset = log(numclaims),family=gaussian(link="log"))

summary(model_gauss)

plot(model_gauss)test$pred <- predict(model_gauss, newdata=test, type="response")

sqrt(mean((test$pred - test$claimcst0)^2))

write.csv(test,"test_gauss.csv")# Models - Gammamodel_gamma <- glm(claimcst0 ~ veh_value+veh_body+veh_age+gender+area+agecat,

data = train,offset = log(numclaims),family=Gamma(link="log"))

summary(model_gamma)

plot(model_gamma)test$pred <- predict(model_gamma, newdata=test, type="response")

sqrt(mean((test$pred - test$claimcst0)^2))

write.csv(test,"test_gamma.csv")

**Python codes can be accessed at following gist path**

**Summary and next steps**

We discussed how to fit a regression model on a highly skewed insurance dataset using GLM techniques, the significance of offset and how gamma distribution is useful in modeling such data.

These are not the only packages available for modeling insurance-specific data, nowadays xgboost like machine learning packages has also introduced objective functions such as Poisson, Gamma, Tweedie, etc, that were earlier limited to only generalized linear models.

Now, we have a frequency model and a severity model, in my next article, we will discuss how we can estimate pure premium using these different models.

Thanks for reading, please feel free to share your queries or suggestions.

**References**