Modeling Insurance Claim Frequency

An illustrative guide to model insurance claim frequencies using generalized linear models in R

Ajay Tiwari

Published in

The Startup

11 min readMar 13, 2020

Automobile Claim follows a Poisson, Negative Binomial, or any other distribution…. let's explore

Objective

Discuss step by step approach for count data modeling with focus on insurance claim frequencies, familiarize with diagnostics and explore techniques to overcome any challenges encountered and finally selecting the appropriate modeling approach to fit the data better.

Background

The first step in developing any pricing model, i.e., predicting pure premium or also known as loss cost model is predicting claim frequencies (expected claim count per unit of exposure), which is a rate instead of a simple count. The common assumption is insurance claim count follows a Poisson distribution which means mean and variance is equal. Therefore a generalized linear model with Poisson distribution and log link function is a natural choice, to begin with.

What are the various challenges we face while modeling claim frequency?

As we have seen the mean and variance should be equal in case of Poisson distribution, however, in many data sets this property is violated because data are often overdispersed. In this case, Poisson distribution underestimates the variance of the observed counts.

Count data often have an excessive number of zero outcomes than are expected in Poisson regression. As an example, the proportion of zero claims in automobile insurance data may be large because usually, people tend not to report small claims.

Often exposures are varied, which means all the observations are not comparable. For example, a count of 4 claims out of 12 months is much bigger than a count of 1 out of 6 months.

Let’s look at the techniques to overcome these challenges

Investigating over-dispersion

Run a preliminary Poisson regression and test the null hypothesis of no dispersion in the model against the alternative hypothesis of overdispersion or under dispersion.

As an output, it will get value for alpha, if α>0, its over-dispersion and if α<0, its under-dispersion. We will discuss this test through a working example during our modeling exercise in the next section.

# Test for dispersion
library("AER")
dispersiontest(poissonglm,trafo=1)

Offsetting exposure to make observations comparable and enable rate modeling

An offset term can be introduced in the model equation, this also enables us to model rate (claim count per unit of exposure) instead of count without compromising with the natural distribution of the data. Count models account for these differences by moving the exposure variable to the right side of the regression equation and taking the log of this variable in model with coefficient constrained to be one. This logged variable, ln(exposure) or a similarly constructed variable is often called the offset variable.

Example — You can see in the below equation how we are offsetting exposure.

poissonglm<- glm(count ~ ., data=training, family = "poisson", offset=log(exposure))

Modeling over-dispersed data

Negative binomial can be tried as the next alternate for such data, compared to the Poisson there is an extra variation in this distribution. The extra positive term in the variance tends to 0 as Negative Binomial tends to Poisson, making mean = variance. Thus Negative Binomial represents Poisson with an extra dispersion

Modeling excessive zero counts

Zero Inflated and hurdle models come to our rescue when we have excessive zero counts.

Zero-inflated models — Zero-inflation phenomenon is a very specific type of overdispersion. It is believed that the excess zeros are generated by a separate process from the count values and that the excess zeros can be modeled independently. Thus, the zero-inflated models have two parts, a Poisson or negative binomial count model and the logit model for predicting excess zeros.

Hurdle Models — A hurdle model is a modified count model in which there are two processes, one generating the zeros and one generating the positive values. The concept underlying the hurdle model is that a binomial probability model manages the binary outcome of whether a count variable has a zero or a positive value. If the value is positive, the conditional distribution of the positive values is managed by a zero-truncated count model.

Setting up the equation — As we discussed in both the above approaches equation consists of two parts, first a count model and second a binary model, regressors of these models are separated by a vertical bar. The two models are not constrained to be the same, explanatory variables can be the same or different based on their significance. You can refer to the following example of a zero-inflated Poisson equation.

zip <- zeroinfl(numclaims~ veh_value+veh_body | veh_value, offset=log(exposure),data=training,dist = "poisson",link= "logit")

Model Comparison and selection

Rootogram — We can assess the model fit using rootograms, an improved approach to assess count regression models. This function can be downloaded from the R-Forge repository under the “countreg” R package.

install.packages("countreg", repos="http://R-Forge.R-project.org")

Rootogram is a graphical technique that compares the square root of observed frequencies and fitted probability models. The observed frequencies are displayed as bars and the fitted frequencies as a line. By default, a square root scale is used to make the smaller frequencies more visible. Please refer the following source for more detail

Working Example using a sample insurance data-set

Now, we will explore each of the mentioned techniques one by one and try to find the best approach for modeling claim frequencies.

We will use an open-source insurance dataset (dataCar) which can be downloaded from an R package called “insuranceData”.

library(insuranceData)
data(dataCar)

Data Description

This data set is based on one-year vehicle insurance policies taken out in 2004 or 2005. There are 67,856 policies, of which 4624 (6.8% notified claims) filed claims

Data Exploration

Before jumping into the modeling techniques, let’s describe the data

We can see claim count is highly skewed; around 93% of policies don’t have any claim, this may be due to non-reporting of small claims.

Target Variable — We will use the claim count as the target variable

Independent Variables — These variables are used as categorical — Vehicle Body, Vehicle Age, Gender, Area and Age category and Vehicle value used as continuous. Claim indicators and claim costs are dropped as they are related to the target variable (claim count).

Exposure — Offset can help us modeling claim count per exposure.

Let’s get started with modeling

Poisson Regression

As a first step to capturing the relationship between claim frequency and all rating factors we fit a Poisson regression model and perform model diagnostics such as dispersion test and goodness of fit.

# Test for dispersion
library("AER")
dispersiontest(poissonglm,trafo=1)data:  poissonglm
z = 3.1228, p-value = 0.0008958
alternative hypothesis: true alpha is greater than 0
sample estimates:
alpha = 0.02459681

Though the null hypothesis is rejected and the test confirms the dispersion, but the Alpha value very close to zero suggests that over-dispersion may not be a serious concern here.

Rootogram also confirms that this model fits the data quite well.

Let’s simulate a scenario of overdispersion in the data

Finding the above type of ideal dataset in the real-world is rare. To give you a flavor of overdispersion, we will add an extra variability in the data (target variable) by randomly generating a negative binomial distribution of data for a small proportion (say 10%) of our observations. I want to keep all other characteristics that remain the same and relevant to the insurance industry such as excessive zeros and very few unique counts. Refer my code and approach in the code section

You can use any technique to generate an overdispersed sample or use any other data-set.

Here is the transformed data-set for next steps

We can see claim count is still highly skewed similarly to original data-set; around 88% policies without any claims.

Here is the complete R code used in this whole exercise

library(ggplot2)
library(dplyr)
library(class)
library(MASS)
library(caret)
library(devtools)
library(countreg)
library(forcats)
library(AER)
library(pscl)
install.packages("countreg", repos="http://R-Forge.R-project.org")#Attaching data for modeling
data(dataCar)
data1 <- dataCar#Data Cleaning & Pre-processingdata2 <- unique(data1)
data3 <- data2[data2$veh_value > quantile(data2$veh_value, 0.0001),] 
data4 <- data3[data3$veh_value < quantile(data3$veh_value, 0.999), ]#Regrouping vehicle categories
top9 <- c('SEDAN','HBACK','STNWG','UTE','TRUCK','HDTOP','COUPE','PANVN','MIBUS')
data4$veh_body <- fct_other(data4$veh_body, keep = top9, other_level = 'other')#Converting catagorical variables into factors
names <- c('veh_body' ,'veh_age','gender','area','agecat')
data4[,names] <- lapply(data4[,names] , factor)
str(data4)##data partition - original data
data <- data4
data_partition <- createDataPartition(data$numclaims, times = 1,p = 0.8,list = FALSE)
str(data_partition)
training <- data[data_partition,]
testing  <- data[-data_partition,]#Re-samplingsample1 <- subset(data4, numclaims!=0)
sample2 <- data4[ sample( which(data4$numclaims==0), 
                          round(0.9*length(which(data4$numclaims==0)))), ]
sample3 <- data4[ sample( which(data4$numclaims==0), 
                          round(0.1*length(which(data4$numclaims==0)))), ]
y <- rnbinom(n = 6323, mu = 1, size = 3) # n value should be equal to sample 3
sample3$numclaims <- y
df_sample <- rbind(sample1,sample2,sample3)##data partition - re-sampled data
data <- data4
data_partition <- createDataPartition(data$numclaims, times = 1,p = 0.8,list = FALSE)
str(data_partition)
training <- data[data_partition,]
testing  <- data[-data_partition,]#Poisson model with offsetpoissonglm <- glm(numclaims ~veh_value+veh_body+veh_age+gender+ area+ agecat,data=training, family = "poisson", offset=log(exposure))summary(poissonglm)# Test for dispersiondispersiontest(poissonglm,trafo=1)#Quasipoisson model with weightqpoissonglm <- glm(numclaims/exposure ~ veh_value+veh_body+veh_age+ gender+ area+agecat,data=training, family = "quasipoisson",weight = exposure)summary(qpoissonglm)#Negative Binomial model with offsetnbglm <- glm.nb(numclaims ~ veh_value+veh_body+veh_age+gender+ area+agecat,data=training, offset=log(exposure),control = glm.control(maxit=10000))summary(nbglm)#Zero Inflation Poisson model with offsetzip <- zeroinfl(numclaims~ veh_value+veh_body+veh_age+gender+ area+ agecat|veh_value+veh_body+veh_age+gender+area+agecat,offset=log(exposure),data=training,dist = "poisson",link= "logit")summary(zip)#Zero Inflation Negative Binomial model with offsetzinb <- zeroinfl(numclaims~veh_value+veh_body+veh_age+gender+area+ agecat|veh_value+veh_body+veh_age+gender+area+agecat,offset=log(exposure),data=training,dist = "negbin",link= "logit")summary(zinb)#Hurdle Negative Binomial model with offsethurdlenb <- hurdle(numclaims~veh_value+veh_body+veh_age+gender+area+ agecat|veh_value+veh_body+veh_age+gender+area+agecat,offset=log(exposure),data=training,dist ="negbin",zero.dist = "negbin",link= "logit")summary(hurdlenb)#Hurdle Poisson model with offsethurdlepoisson <- hurdle(numclaims~veh_value+veh_body+veh_age+gender+ area+ agecat|veh_value+veh_body+veh_age+gender+area+agecat, offset=log(exposure),data=training,dist ="poisson",zero.dist = "poisson",link= "logit")summary(hurdlepoisson)#Save modelssave(poissonglm, file = "poissonglm.rda")
save(nbglm, file = "nbglm.rda")
save(zinb, file = "zinb.rda")
save(zip, file = "zip.rda")
save(hurdlepoisson, file = "hurdlepoisson.rda")
save(hurdlenb, file = "hurdlenb.rda")#Load Models
load("poissonglm.rda")
load("nbglm.rda")
load("zinb.rda")
load("zip.rda")
load("hurdlepoisson.rda")
load("hurdlenb.rda")# Codes to predict zero claims:zero_counts <- data.frame(round(c("Obs" = sum(training$numclaims < 1),"poissonglm" = sum(exp(-predict(poissonglm, training, type = "response"))),"nbglm" = sum(dnbinom(0, mu = fitted(nbglm), size = nbglm$theta)),"hurdlepoisson" = sum(predict(hurdlepoisson, training, type = "prob")[,1]),"hurdlenb" = sum(predict(hurdlenb, training,type = "prob")[,1]),"zip" = sum(predict(zip, training,type = "prob")[,1]),"zinb" = sum(predict(zinb, training,type = "prob")[,1]))))# Installing and running rootogram
install.packages("countreg", repos="http://R-Forge.R-project.org")
library(countreg)par(mfrow = c(1, 2))
rootogram(poissonglm,max = 10,main="Poisson") # fit up to count 10
rootogram(nbglm,max = 10,main="NB") # fit up to count 10
par(mfrow = c(1, 1))
rootogram(zip,max = 10,main="ZIP") # fit up to count 10
rootogram(zinb,max = 10,main="ZINB") # fit up to count 10
par(mfrow = c(1, 1))
rootogram(hurdlepoisson,max = 10,main="Hurdle-P")# fit up to count 10
rootogram(hurdlenb,max = 10,main="Hurdle-NB") # fit up to count 10
par(mfrow = c(1, 1))#Log likelyhood for all the models
models <- list("Pois" = poissonglm, "NB" = nbglm, "ZIP-POI" = zip,"ZIP-NB" = zinb,"Hurdle-POI" = hurdlepoisson,"Hurdle-NB" = hurdlenb)df_log <- data.frame(rbind(logLik = sapply(models, function(x) round(logLik(x), digits = 0)),
Df = sapply(models, function(x) attr(logLik(x), "df"))))

Poisson and negative binomial regression models

First, we fit Poisson regression and assess the model fit including dispersion test

# Test for dispersion
library("AER")
dispersiontest(poissonglm,trafo=1)data:  poissonglm
z = 6.0823, p-value = 5.923e-10
alternative hypothesis: true alpha is greater than 0
sample estimates:
alpha = 3.148954

Our hypothesis has been rejected with greater significance and alpha value can be observed which is much higher than zero. Therefore, the test confirms that the new sample is overdispersed and in an attempt of a better fit we will try negative binomial regression.

We can infer from the above graph that Poisson regression is not the ideal choice of fitting this data-set, the model doesn’t fit the data well. Graphs are hanging around the x-axis, which explains that the model under fits 0 and over-fits 2 onward counts.

Let’s take a look at the negative binomial results, this looks better than Poisson, overdispersion is well accounted for, but still, there is a scope of slight improvement for some of the counts such as 1 is under-fitted, 2 and 4 are over-fitted.

Zero Inflated Models

Negative binomial has a better fit, but still, other counts except 0 are not fitted well, this may be due to excessive zeros, so let’s try models which are better suited for this condition

Modeled data through 2 variants of zero inflation models, one with Poisson distribution and other with negative binomial. In both the versions outcomes are similar to negative binomial, zero counts are well taken care of, whereas slight deviation for other counts.

Hurdle Models

As we have discussed in the above section that the hurdle model has 2 different components one for predicting zero and second a count model for only positive values. Let’s fit our data to these models.

We can clearly observe hurdle models are performing much better than the rest of the techniques, fitting the data well. Out of these two, hurdle with negative binomial distribution is the winner.

We can assess our models using log-likelihood numbers; Poisson has the least value, whereas hurdle with NB has the maximum. This confirms our finding that the Hurdle NB model is the best fit.

Log-likelihood and Degree of Freedom for All Models

Finally, let’s compare observed with predicted 0 counts. Poisson model has the highest deviation, followed by zero inflation and negative binomial models. Here as well the hurdle model shows a perfect fit for all the counts, by its design hurdle will always match to the observed hurdle count, in this case, 0 is the hurdle.

Zero Predicted Counts

Summary

In this article, we discussed the applicability of count data in insurance claims, related challenges, diagnostics to uncover those challenges and choosing a suitable model from a range of regression techniques starting from classical Poisson GLM, negative binomial to dual models like zero inflation and hurdle models. We have also covered Rootogram, a new approach of assessing goodness of fit along with traditional approaches. There are some other techniques for this kind of dataset which you can try yourself, for example building 2 different models, first a logistic regression for classifying zero and non-zero and other a truncated count model for positive counts.

Modeling claim severity is another important component in Insurance pricing which I will discuss in my next article.

References

dispersiontest: Dispersion Test, https://rdrr.io/cran/AER/man/dispersiontest.html

Gavin L. Simpson, Rootograms (2016), https://www.r-bloggers.com/rootograms/

rootogram: Rootograms for Assessing Goodness of Fit of Probability, https://rdrr.io/rforge/countreg/man/rootogram.html

(pscl) Regression Models for Count Data in R http://cran.r-project.org/web/packages/pscl/vignettes/countreg.pdf