Strategies for handling imbalances in insurance data

Published in

If Technology

7 min readMar 15, 2024

When I started my journey at If in 2018, the only business area using machine learning I could name with any certainty was pricing. It soon became clear that multiple customer-facing contact points, with their substantial amounts of data, presented many opportunities for machine learning.

Marketing wants to optimize marketing impact and ad targeting. Onboarding and customer service have a lot to give with predicting customer needs, and potential risks coming from external events such as natural disasters. Risk and pricing have a long history of using general linear models (GLMs). Finally, the claims handling department deals with everything from predicting claim types and volumes, to more specific things like recourse and fraud.

One common denominator for these areas is that the interest usually lies in a small subset of the data rather than the larger majority. House fires, natural disasters, and fraud are rare events when looking at all claims and customers. This takes us to the challenge of modeling with imbalanced data.

Imbalanced data comes in many forms. Most motor claims concern chipped windscreens or cracked bumpers, while cars self-combusting or getting stolen are usually in the small minority. The same can, fortunately, be said about common colds versus serious sicknesses and injuries in the personal insurance portfolio. The class in the data you want to model might be 10 %, 1 %, or an even smaller proportion of your total data.

Another related issue is missing labels. For example, a use-case in fraud detection is to figure out why a large part of fraud goes undetected, and how to model that underlying behavior. This puts additional strain on model evaluation where novel metrics are needed to figure out how well your model is going to perform in production.

So, what is the issue with imbalanced data? A prominent challenge related to traditional machine learning is that the models tend to learn the signals and biases from the majority class. The more samples with relevant combinations of features the model gets to see, the more confident it gets. This also means that the model has difficulty figuring out what makes a minority case a minority case, and what separates them from the majority cases.

Models might be able to deliver acceptable results with slightly skewed data, but when dealing with ratios of 1:100 or 1:1000, the model concludes that the best accuracy is derived from assuming that every case belongs to the majority. While this might result in a high accuracy score on the evaluation, the model fails to solve the business case which is finding the minority cases. Then how should one go about solving this problem?

There are 2 ways to approach this issue:

The data approach, where the goal is to make the training data more balanced. The typical way of going about this is either to under-sample the majority class or over-sample the minority class.
The model approach relates to the models and their workings, weighing the labels and choosing the best model(s) for the task.

The data approach

If you are not getting the results you were hoping for, it is usually best to get back to the data and see if something can be done. When it comes to working with imbalanced data, the main approaches are either under- or over-sampling.

Under-sampling means that you remove samples from the majority class to get a better ratio between the classes. Models like logistic regression favor ratios closer to 1:1, while newer models like random forest and ensemble models give adequate results with higher ratios.

A thorough analysis of the data is crucial, as randomly removing samples from the majority class might result in losing information that could be beneficial for the modeling. This could lead to bad model performance but also result in low external validity, where the model fails to generalize on the real-world data.

The more popular Python libraries, for example, scikit-learn and its imbalanced friend, imbalanced-learn, have built-in strategies and stratifying options when it comes to sampling data for general purposes and train/test splits. They also include more sophisticated methods of under-sampling, for example using Tomek links, a technique specifically for removing majority cases that are close in feature space to the minority class.

Under-sampling is preferred when there is an abundance of training data. If this is not the case, then over-sampling could be considered. Over-sampling is randomly selecting cases from the minority class to be duplicated. The advantage of over- versus under-sampling is that over-sampling induced no loss of information. The downside is that over-sampling by duplication is not adding new information. A good understanding of the data is crucial, as random over-sampling increases the chances for overfitting.

An alternative to random over-sampling through duplication is creating synthetic data based on minority class. A popular approach is called the Synthetic Minority Oversampling Technique, or SMOTE. Simply put, it interpolates synthetic data points between some chosen neighboring cases from the minority class. The created cases are as such “unique” and reduce the risk of overfitting.

The downside is that the relation between minority and majority cases in feature space is not considered, so the approach could create overlap between classes which results in noise and decreased model performance. Some approaches to mitigate this are using modified SMOTE or adaptive synthetic sampling (ADASYN), which use different methods to mitigate these issues.

The model approach

When the data is processed and ready for use, the next area to look at is modeling. Choosing the right modeling approach for training is crucial and will be dependent on the outcome of the data prep.

Although multiple approaches should be tried, tree-based models generally tend to perform better on imbalanced data than general linear models. Linear models model the outcome as a linear function, where the tree-based models use hierarchical trees, and can model complex relations like categories, as well as non-linear relationships between classes.

More advanced techniques like ensemble modeling — aggregated or voted output from multiple different trained models or similar models on models trained on different samples of the output — can result in improved predictive power, while also reducing the bias-variance tradeoff. This approach really shines when there are both linear and non-linear relationships in the data. These advantages come with the price of usually being more computationally expensive, and outputs might also be more difficult to interpret, which might be an issue in some use cases.

One last approach to modeling is tuning the model learning through weights and loss functions. The first approach to look at is weighing the data so that misclassification of minority cases results in a higher penalty. Another more novel approach is using something called “Focal loss”. Similar in practice to weighing, focal loss reduces the impact of easy to classify cases on the loss function, thus giving more weight to better classifying difficult cases.

Metrics

As shown, if the data is skewed, looking at accuracy will not work. Fortunately, there are a bunch of useful metrics that can be used to better assess the model performance.

Depending on your use case, the main metrics you will want to look at are precision and recall, and the metrics based on them, like the F1 score. Let us go through this with the example of trying to find fraudulent transactions. Precision will tell you how many of the samples the model picked out as fraudulent were labeled as fraudulent. Recall says how many fraudulent cases were found, out of all fraudulent cases in the test data.

If you are training multiple models and want a smooth way of comparing, the F1 score is the harmonic mean of precision and recall and gives a single value that helps comparison. The bottom line is that the metrics should be chosen for the purpose, and sometimes the best choice is just looking at a confusion matrix that shows the raw numbers of the model performance on the evaluated data.

There are also some good tools like SHAP, that let you get a better understanding of feature importance and weights, to understand why you might be getting the results you are getting.

Conclusions

Data science is a lot like brewing beer. It sounds too good to be true until you try it and realize that most of the time goes into cleaning and hygiene factors. Real-world data is usually messy, imbalanced, and difficult to interpret. The best approach is to make sure that every step from use-case and data prep to model evaluation and production is done right before moving over to the next part. There are many interesting tools and approaches, so do not be afraid to experiment and question assumptions.

Strategies for handling imbalances in insurance data

Written by Alexander Björkqvist