Machine learning-based insurance claims modelling: part 1

Published in

Eika Tech

8 min readAug 23, 2023

Insurance is at its heart the business of prediction. Policyholders wish (or are in many cases obligated) to protect themselves against certain types of risks, such as a catastrophic car crash leading to a liability lawsuit from a third party. The insurance company agrees to take on that risk in exchange for a premium to be paid by the customer. The “pure” premium is the amount that the insurer expects to pay out in claims on average by taking on a particular policy. The insurance premium paid by the customer is thus this pure premium, plus expenses and profits. For example, if an insurer estimates that a customer will on average incur 500 denarii worth of claims, with 50 denarii worth of expenses, and a desired profit (based on the pure premium) of 20%, then the total premium paid by the customer will be 500 × 1.2 + 50 = 650 denarii.

If the insurance company gets the risk profile wrong, customers who are offered underpriced (cheap) insurance are likely to flock to or stay with the insurer, whereas customers who are offered overpriced (expensive) insurance are likely to go elsewhere. If customers are price-sensitive, insurers face the risk of a cascading anti-selection process, meaning that risk quantification errors will necessarily eat into profits. Consequently, insurance companies tend to invest heavily in their pricing function.

Generalised Linear and Additive Models

A common way to estimate the pure premium of a policy is to independently estimate the frequency and severity of future insurance claims based on historical data and relevant risk factors. Generalised Linear Model (GLMs) under log link is a commonly used technique for this purpose. In large part this is because of the intuitive appeal and perceived transparency of having a model that can be expressed in multiplicative form:

π = exp(w₁ x₁ + w₂ x₂ + … + wₙ xₙ) = exp(w₁ x₁) × exp(w₂ x₂) × … × exp(wₙ xₙ).

Here, π denotes the predicted claim frequency, claim severity, or pure premium, with w denoting weights and x denoting variables (e.g. age of policyholder, vehicle specifications, etc.). The resulting model is a set of a factors, one for each variable, that can be multiplied together to obtain an estimate of the claim frequency, claim severity, or pure premium. Practitioners often bin continuous variables in GLMs and effectively model these as categorical variables. The reason for this is that GLMs natively provide few tools to flexibly model continuous effects, and continuous variables such as age often have highly nonlinear relationships with risk.

Generalised Additive Models (GAMs) have become increasingly popular in actuarial applications and extend the GLM framework by modelling continuous effects through the use of nonlinear functions.

Claim frequency and severity modelling

When modelling claim frequency (i.e., how often a particular policy is likely to result in a claim), practitioners typically assume a Poisson distribution of the observed claim count. In this case, the logarithm of the underlying claim frequency is modelled as a linear combination of the features. For example, a claim frequency of 0.4 would mean that a particular policy would result in a claim being made, on average, once every 2.5 years. However, in most years, the number of claims will be zero, and occasionally we will see 1, 2, or even 3 claims be made. The following graph shows the probability distribution of such counts given a claim rate of 0.4 claims per year (left plot).

Once we know how often a claim is likely to happen, we also need to know the claim severity, i.e., how large the claims are likely to be when they do occur. Again, GLMs under log link are leveraged, but typically assuming a Gamma distribution of the response variable. The right plot above shows a Gamma distribution with an average claim severity of 2000 denarii and a standard deviation of 1414 (shape = 2, scale = 0.001).

While the Gamma distribution is a useful approximation, it is worth noting that it does not fully reflect the actual claim severity distribution. More specifically, real claims have a fat tail, meaning that we see far more extreme losses than we would expect to see under a Gamma distribution. Specialised techniques have been developed for modelling the tail-end of the severity distribution (Beirlant & Teugels, 1992).

Once we have estimated the expected claim frequency and the expected claim severity, we simply say that the estimated pure premium is the expected claim frequency times the estimated claim severity. In the above case, with a claim frequency of 0.4 and a claim severity of 2000 denarii, the pure premium for this policy would be 0.4 × 2000 = 800 denarii.

The power of the Tweedie distribution

While splitting the pure premium calculation up into frequency and severity steps is both simple and powerful, there are also ways of modelling the pure premium directly. The Tweedie distributions are a subfamily of probability distributions, which include the normal, Poisson, and Gamma distributions. Importantly, the Tweedie distributions also include compound Poisson-Gamma distributions, where the random variable is the sum of k picks from a Gamma distribution. The number of samples k is itself a Poisson-distributed variable with some underlying rate parameter. This is very similar to the two-step method outlined above, but by modelling this as a compound Poisson-Gamma distribution rather than as two separate processes, we can model these distributions together in one fell swoop rather than independently. Formally, this is achieved by setting the power parameter of the Tweedie distribution (often denoted p) between 1 and 2, with 1 corresponding to a pure Poisson distribution, and 2 to a pure Gamma distribution. Common choices for the Tweedie power parameter include 1.6, 1.67, and 1.7 (Goldburd et al., 2021). The exact value of the power parameter often does not make a major difference to the model outcome. The figure below shows a Tweedie distributed variable with μ=8000, ϕ=7, p=1.67. Note that this produces an implausibly low density at 0 for insurance applications (over 75% of the mass/density will typically be exactly at 0 for most insurance policies), but we use this parameter combination for illustrative purposes.

Compound Poisson-Gamma distributions simultaneously exhibit properties of discrete probability distributions such as the Poisson, and properties of continuous distributions such as the Gamma. For example, they have a positive point mass at zero, while being continuous everywhere else. For insurance applications, this slightly odd statistical property is very useful since it means that we can readily model the fact that most policy holders make no claims in most years, yet when they do the total claim amount will be some continuously distributed variable.

Tweedie regression is supported in R using the tweedie function in the statmod library, while in Python it can be achieved using either scikit-learn (through the TweedieRegressor class) or statsmodels (through the Tweedie family).

Beyond Generalised Linear and Additive Models: Gradient Boosting Machines

In recent years, Gradient Boosting Machines have become a dominant force in the wider machine learning literature and have come to be recognised as state of the art in building highly performant models on tabular data (https://mlcontests.com/state-of-competitive-machine-learning-2022/). GBMs work by training a set of weak learners (i.e., models with poor predictive power on their own, typically decision trees), where each subsequent learner has a goal of improving on the predictions of the previous ensemble of learners. Given their excellent predictive capabilities, it is not surprising that GBMs have also received attention in the insurance literature, with several recent papers showing that GBMs generally outperform GLMs, GAMs, and neural networks in modelling claim frequency, claim severity, and the pure premium (Guelman, 2012; Fauzan & Murfi, 2018; Ciatto et al., 2022).

GBMs are incredibly powerful techniques, yet they are rarely used to set the actual pure premium component of insurance policies. In most cases, there are no regulatory requirements that the pure premiums be calculated using a GLM or GAM, and so the reason is more likely that GBMs are perceived to be less transparent than GLMs and GAMs, where the pure premium can be easily expressed in multiplicative form (when using a log link).

Using GBMs while assuming a Tweedie-distributed response variable can provide a very useful and flexible way of building pure premium models. In the case where we assume a log link (which is by far the norm in insurance applications), each subsequent tree now aims to identify a factor which we want to multiply the suggested pure premium of the previous ensemble. In other words, if the first N-1 trees create some estimated pure premium, the task of the Nth tree is to find a factor which gives a better estimate of the pure premium when multiplied with the prediction of the previous N-1 trees. A better estimate of the pure premium in this context would be one that minimises the Tweedie deviance.

GBMs perform extremely well when it comes to rank-ordering risk and can dramatically outperform traditional actuarial methods. However, GBMs lack some desirable properties of GLMs which means that implementing them in an actuarial context is not always straightforward. For example, while a GLM will have very little to no global bias (the average predicted claim cost will equal the average observed claim cost), GBMs can often be off the global average by several percentage points. This is a problem locally as well, meaning that the expected claim cost of a predicted pure premium of 1000 is not exactly 1000 (again, it can be off by several percentage points). We will refer to such models as being uncalibrated. The below plot illustrates the calibration property by showing the predicted pure premium on the horizontal axis against the true pure premium on the vertical axis for an uncalibrated (left) and a calibrated (right) model.

Ideally, we wish to have a model that simultaneously excels at rank-ordering policy risk profiles while also accurately estimating the expected claim cost (the pure premium). The absence of intrinsic calibration is not a property that is specific to GBMs. Indeed, all high variance regression models likely display these behaviours, though GBMs and neural networks are the two classes of models that have been the most studied in actuarial applications. Over the past 5 years, several methods have been developed for calibrating complex regression models for actuarial applications (Denuit, Charpentier, & Trufin, 2021; Denuit & Trufin, 2021; Denuit & Trufin, 2022; Wuthrich, 2023; Ciatto et al., 2022). At Eika, we use GBMs in a range of applications and have implemented a method for autocalibrating our GBMs through local polynomial regression (Denuit & Trufin, 2022; Ciatto et al., 2022). We will detail how we have implemented this in Part 2 of this post.

Machine learning-based insurance claims modelling: part 1

Generalised Linear and Additive Models

Claim frequency and severity modelling

The power of the Tweedie distribution

Beyond Generalised Linear and Additive Models: Gradient Boosting Machines

Written by Sindre Henriksen