How Synthetic Data Helps Insurance Business

14 min readMar 11, 2024

Authors

Mireia Rojo Arribas, Head of Advanced Analytics, MAPFRE Insurance.

Faxi Yuan, PhD in Design, Construction and Planning. Data Scientist, MAPFRE Insurance.

Hamed Farahmand, PhD in Civil Engineering. Data Scientist, MAPFRE Insurance.

As the number one home and auto insurer in Massachusetts, with a presence in 11 states, MAPFRE Insurance is always looking for ways to improve its operations while delivering quality coverage to consumers. Refining the claims process, specifically when investigating fraudulent claims, offers a great opportunity to incorporate cutting-edge Artificial Intelligence (AI) to help identify and combat potential fraud. The insurance industry faces significant losses annually due to fraud. According to 2022 figures from the Coalition Against Insurance Fraud (CAIF), $308.6B are stolen every year, with property-casualty losses making up 10% of cases.

To help curb potential losses, MAPFRE created a process that identifies, flags and refers claims suspected of fraud. The AI prevention system, which applies Machine Learning (ML) and Graph Analytics, analyzes multiple historical data points to detect fraud patterns. The referral process uses both business rules and an AI-driven approach created by MAPFRE’s Technical Claims and Advanced Analytics teams.

When a claim is created, the referral process uses the AI model to “score” the claim. Claims that are scored as more likely to be fraudulent are referred to the Claims team, who together with the Special Investigation Unit (SIU), conduct an investigation and make the final determination as to whether the claim is fraudulent. At no point does the referral process make any decision; it acts as a recommendation system for the Claims team.

The AI-driven fraud detection process has helped the Claims team improve the accuracy and efficiency of fraud detection, contributing to maintaining trust with the policyholders and providing cost savings to MAPFRE’s bottom line. The first AI-driven process was implemented for auto claims, and, considering the positive results shown with the false positive reduction, later for homeowners (HO) claims, which is the focus of this document.

As mentioned above, the AI fraud models are trained using historical data points. For the case of property, when the dataset was generated, the Advanced Analytics team noticed the dataset was very imbalanced. This implied that, historically, fraudulent claims have significantly fewer instances than non-fraudulent claims, posing a challenge for the AI algorithm to learn from the minority class, “fraudulent claims”, and develop an accurate HO fraud detection model. To overcome this challenge of limited data size and extreme imbalanced data, the team at MAPFRE proposed the use of synthetic data to augment both data size and diversity.

IBM (2023) described synthetic data as computer-generated information, aiming to enhance AI model performance, protect sensitive data and mitigate bias. Such synthetic data generation belongs to the domain of Generative AI, which is widely used for the generation of unstructured data including text, images and videos. In the experiment, MAPFRE’s Advanced Analytics team, first partnering with a third-party vendor, and later on their own, utilized one of the Generative AI models to generate synthetic tabular data, and eventually augment the AI model performance. The results were successful and the models are currently in production, which implies that there is an AI-driven fraud detection process for HO where the creation of synthetic data has been essential for its development.

1. Synthetic Data

A successful model development process requires collecting and processing high-quality historical data from different sources. To develop the HO fraud detection model, several data sources were selected; these are, and are not limited to: claims and policies information, notes on the claims and policies, graph-powered data of interconnections between claims, geocode data, weather data, etc. The main challenge in this model, however, was with regards to the imbalance in the target variable: fraudulent claims, and the small size of this class. This supposed a critical challenge in the ML and Graph Analytics model development, as the AI algorithm did not have enough data to learn the characteristics of fraudulent claims. In addition, the data size for the minority class in target variable (i.e., fraudulent claims) is small, which limits the use of conventional solutions for data imbalance such as under-sampling. Therefore, other techniques were employed in model development to manage the data imbalance in the target variable and the small minority class size: the use of synthetic data to develop an accurate model.

As mentioned in the introduction, the generation of synthetic data belongs to the domain of Generative AI. According to Nvidia (2023), Generative AI is a popular tool for quickly generating new content based on various inputs, where inputs and outputs can be text, images, sounds, animation, 3D models or other types of data. To be specific, neural networks are employed by Generative AI models to detect the patterns and structures of training data to generate new and original data.

Open-source Generative AI models and companies are available for generating synthetic data. To try to solve the fraud model challenge, MAPFRE’s Advanced Analytics team started by collaborating with a third-party vendor, using their tool to generate synthetic data. The team further trained their ML fraud models with both synthetic and historical data. The tool used ensemble methods to generate synthetic data, which consists of random sampling, and Generative AI models. The generation method for specific variable of the training data was automatically defined by the tool itself. With the tool from the third-party vendor, the MAPFRE Advanced Analytics team used its generated synthetic data to train the ML model. In the meantime, MAPFRE’s Advanced Analytics team also explored the open-source Generative AI models to generate synthetic data, developing a new ML model in parallel. The team compared the ML model performance with the synthetic data generated by the tool provided by the third-party vendor and the open-source models. It found better results for the ML model generated with historical data and synthetic data generated by the open-source Generative AI model called Conditional Tabular Generative Adversarial Networks (CTGAN). This section below will discuss the GAN and CTGAN models.

1.1 GANs and CTGANs

Among the popular Generative AI models, Generative Adversarial Networks (GANs) proposed by Goodfellow et al. (2014), was widely used for generating new content such as images. The GANs model consists of two models, a generator model to capture data distribution and generate new content, and a discriminator model to distinguish the generated content as real or fake. Both generator and discriminator are neural networks as illustrated in Figure 1.

The generator takes the random generated data as input, learns its patterns and distributions, and generates new data, while the discriminator takes both generated data and random selection of real data as input and tries to distinguish them. The generator and discriminator are trained together, which results in a smarter generator that provides better content and a discriminator capable of distinguishing generated content from real data. The training procedures continue in several iterations until the discriminator can no longer distinguish generated data from real data.

**Figure 1**. Process and components of GANs model. Source: MAPFRE Insurance.

However, the GANs model have seen two critical challenges in generating tabular data in twofold: 1) continuous values are usually non-Gaussian compared with the Gaussian-like distribution of pixels from images data; and 2) categorical variables have highly imbalanced distribution and the minority class is more likely to be ignored in the

training process. To overcome these issues, Xu et al. (2019) proposed the CTGAN, where the mode-specific normalization, a conditional generator, and training-by-sampling strategy were implemented based on the GANs model.

Based on the GANs model, CTGAN introduced the conditional vector based on one of the categorical variables and inserted it alongside the randomly generated vectors as input to train the generator to learn the data patterns and distribution as illustrated in Figure 2. For instance, if there are two categorical variables, loss city and claimant type, they will be encoded with the one-hot encoding method.

In Figure 2, the team used claimant type as insured to create the conditional vector, and you can see the values of conditional vector has one non-zero value at the position of if the claimant type is insured, while three values of city like Boston, Worcester, Cambridge, and the value of claimant type as not are all zeros. This conditional vector will be included with the randomly generated vector as the input to train the generator. Accordingly, the sampled real data row will also contain the value of claimant type as insured and be further used with the generated data to train the discriminator. As a result, the CTGAN model considers the global variable distributions of all the variables from the training data.

**Figure 2**. Process and components of CTGAN model. Source: MAPFRE Insurance.

1.2 Implementation of Synthetic Data into MAPFRE Project

To improve the performance of the ML models for the HO fraud detection model developed with just historical data, the Advanced Analytics team leveraged the capabilities of both third-party vendor’s models and CTGAN models. This provided the opportunity to compare the potential improvements resulting from each synthetic data generation method and select the optimal method for the project.

Before continuing, it should be noted that the real (and historical) datasets were initially split into train and test sets. The train set was used for synthesizer training and later combined with synthetic data for ML model development while the test set was only used for performance evaluation for ML models and not used for synthesizer and ML model developments. Therefore, any ML model result exposed in this section corresponds to the model performance on the test set.

Considering imbalance in target variable as the main challenge, MAPFRE’s Advanced Analytics team started using synthetic data to augment the minority class. Specifically, the team trained a synthetic data generator model (synthesizer provided by the third-party vendor) with real data. As with any other model development process, training the synthesizer is the most time-consuming step. It requires setting hyperparameters for the model and then performing model training using the input data. After the model is trained, it can generate synthetic data with different sizes and different ratios of target variable (or any given feature) by rebalancing the generated data conditioning on the given variable (target or feature). For this model, the team trained two synthesizer models using the third-party vendor and CTGAN respectively, where hyperparameters tunings were performed to improve performance.

It should be noted that both synthesizers were trained using only the selected features used in the final ML model. This is a relevant point, as an initial feature selection needs to be conducted before generating synthetic data. The main reason being that the synthesizer’s training time increases as the number of features becomes larger.

The team then deployed trained synthesizer models for generating synthetic datasets. In the initial experiment, the team generated synthetic datasets with the same size as the real training dataset. Each synthetic dataset has a specific target class ratio. Various ratios including 0.5%, 1%, 5%, and 10%, as well as the ratio by default from the synthesizers were employed to adjust the distribution of target variable. Thereafter, the team combined each of the synthetic datasets with the real training dataset to train the ML model to evaluate their model performance. After comparing the ML model performances with the initial ML model (i.e., ML model trained without synthetic data), the MAPFRE Advanced Analytics Team selected the best dataset that can yield the best ML model performance and selected it as the base model.

The team performed three sets of additional experiments using the synthesizer model that performed best in the initial experiments. In each set, one parameter related to synthetic data was changed in a defined range; then for each instance of the parameter, a synthetic dataset was generated and used for ML model development. The three experiments varied in the following ways:

Only using minority target class from synthetic data: Since the main purpose of using synthetic data in this project has been to deal with target class imbalance, the team designed this experiment to evaluate if only using the minority class from synthetic data can help ML model improvement.
Using minority target class plus a portion of majority target class from synthetic data: In this set of experiments, the team tested if adding a portion of majority class can yield better model performance.
Only using synthetic data for training: The team generated synthetic data of different sizes and solely used the synthetic data for training. The difference between this set of experiments with previous experiments is that, in these experiments, real data is only used for synthesizer training and is not used for ML model training. This experiment tried to simulate what would happen if the model could not be trained with real data, but just anonymized data.

2. Results

Let’s start with initial results of ML model development using different sets of synthetic data. Initial results showed that both the third-party vendor and CTGAN model synthesizers are able to boost the performance of the ML model. Meanwhile, we could see that for this project, the trained CTGAN model had slightly better performance. Therefore, we focus on the results of the performance of CTGAN model. Table 1 shows the initial results of the ML model development using different sets of synthetic data generated by CTGAN model for the HO fraud detection model. The best performing model was achieved in two experiments, where the ML model was trained using real training data combined with synthetic data generated from CTGAN model without rebalancing, and where the ML model was trained using real training data combined with synthetic data generated from CTGAN model with 0.5% of minority class ratio. These models were able to lift precision from 1.38% to 2.11% and recall from 62.5% to 87.5%. Therefore, for the next experiments, we selected the CTGAN model as the synthesizer.

Table 1. ML model performance on different sets of synthetic data for HO fraud detection model (number of features<100). Let us note that the threshold results are confidential; however, it can be ensured that the threshold is constant for both Table 1 and Table 2.

*Altered: Instead of adding all the synthetic data, only the minority class is added.

Next, let us focus on the next sets of experiments using the synthesizer trained using CTGAN model. Table 2 shows a summary of the experiments and the performance of ML models developed using the synthetic data generated for each experiment.

Experiment 1. Only using minority target class from synthetic data.

Experiment set 1 was performed on the different ratios of minority class, including 20%, 10%, 5%, 1%, and 0.5% (refer to Synth Data Minority Class Ratio column). As can be seen in the model, none of the generated synthetic data were able to beat the base model (which was the best performing model from initial experiments).

Experiment 2. Using minority target class plus a portion of majority target class from synthetic data.

Next, we moved to the Experiment set 2, where we also included a portion of majority class from synthetic data in ML model training. We selected two minority class ratios for this experiment set: 1%, which yields the best performance in Experiment 1 and without rebalancing, which is the minority ratio of the base model. We also selected three ratios of majority class including 25%, 50%, and 75% (refer to Majority Class column). As can be seen, the ML model trained on synthetic data with 1% minority class ratio and 50% majority class shows some improvements over the base model. Specifically, the precision for the model is 2.26% versus 2.11% in base model and recall were increased to 93.75% from 87.50% in base model.

Experiment 3. Only using synthetic data for training.

In experiment set 3, where only the synthetic data is used for training ML model, we used 1% and 0.5% class ratio with different multipliers including one, two and three (refer to Multiplier column). Multiplier shows the size of the synthetic data relative to the size of the real training data. For example, when multiplier equals three, the size of synthetic data in training is three times the real training data and the overall training data is four times the size of real training data. As expected, the use of synthetic data without real data for ML model training did not show any improvement in this project.

Table 2. ML model performance in different experiments for HO fraud detection model (number of features = table 1, multiplier shows the size of synthetic data relative to the size of real training data). Let us note that the threshold results are confidential; however, it can be ensured that the threshold is constant for both Table 1 and Table 2.

The improvement caused by using the synthetic data leads to a positive impact on the results expected from using the model. Hereunder, using fake numbers, let us show the potential improvement of the use of both synthetic and historical data. For example, recall is calculated as the number of claims correctly identified as fraudulent divided by the total number of fraudulent claims. Assuming a constant number of fraudulent claims of 100 per, the improvement of recall from 62.5% to 93.75% leads to capturing around 31 more fraudulent exposure per year.

Assuming an average of $10,000 in savings per exposure, it can then be translated into an annual $310,000 in savings achieved using synthetic data in ML model development. While these estimations are not guaranteed due to limitations for estimating model performance when data size is small, it provides a high-level view of the benefits of using synthetic data in ML model development process.

3. Conclusions

In this article, we discussed the application of synthetic data for the Homeowners Fraud Detection Project for improving the predictive power of the predictive model. The main challenge that the synthetic data was supposed to address was the major imbalance in the minority target class and small size of the minority target class. To overcome this challenge, MAPFRE’s Advanced Analytics team used two different synthetic data generation models and designed several experiments to achieve considerable improvement in the ML model performance. As discussed in the document, in this case, introducing synthetic data into ML model training led to significant improvements in ML model performance, which is expected to increase the potential impact of the model in business. This project can be considered a showcase of the capability of synthetic data for improving performance of AI models and encourage the expansion of applying synthetic data in other predictive models’ use cases. While the improvement is not guaranteed, the use of synthetic data can be recommended when facing data imbalance and small data size issue. Nevertheless, the application of synthetic data is not limited to these cases and can be tested in any model development.

References

Coalition Against Insurance Fraud. (2024). Fraud Stats. https://insurancefraud.org/fraud-stats/, last accessed on Jan 8, 2024.

Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., … & Bengio, Y. (2014). Generative adversarial nets. Advances in neural information processing systems, 27.

IBM Research. (2023). What is synthetic data? https://research.ibm.com/blog/what-is-synthetic-data, last accessed on Dec 4, 2023.

Nvidia. (2023). What is Generative AI? https://www.nvidia.com/en-us/glossary/data-science/generative-ai/, last accessed on Dec 4, 2023.

Xu, L., Skoularidou, M., Cuesta-Infante, A., & Veeramachaneni, K. (2019). Modeling tabular data using conditional gan. Advances in neural information processing systems, 32.