Towards generating realistic synthetic insurance data

Published in

gft-engineering

8 min readApr 27, 2021

This blog post details our motivation and methods for generating realistic synthetic insurance data.

Obtaining good quality data is a limiting factor in many data science projects and industries such as insurance can be very protective of their datasets. The AI team at GFT found that a lack of publicly available data was an issue when creating a demo of Google Cloud Platform (GCP) AI services for insurance applications. We solved this problem by creating our own synthetic datasets and in this post, I’ll describe our data requirements, how we created our data and the challenges we faced.

The demo we prepared was designed for insurers who had recently migrated their processes to the cloud and, in doing so, could now access new AI services and work with their data in a way not previously possible. A key target was to keep the data and modelling approaches realistic so the demo would be relatable to insurance professionals. Two concepts were chosen: automobile and home insurance policy pricing. In both cases, the idea was to export a dataset from an insurance application (GCP hosted instance of Guidewire [1]) into BigQuery, enrich the data using other sources and utilise AI Platform to build and deploy risk models. These models could then be called when making a pricing request through the insurance application. The central elements we wanted to demo were:

Integration with Guidewire.
Ease of working with big datasets.
Ability to easily bring in additional data sources for data enrichment.
Value added using GCP AI services with our custom architecture and tools.

To provide a useful signal to the pricing engine, we planned for our models to generate a prediction of the expected frequency and severity of claims for a given customer over their insurance terms which is a common approach for insurance pricing [2]. From this initial plan, we had established several requirements for our datasets:

Automobile and home insurance data.
Large in scale.
Rich feature set to allow for data enrichment.
Frequency and severity features to model.

We started the work by looking online for public datasets but found limited options available with none matching our full set of requirements. To get around this issue, we decided to generate new synthetic datasets based upon public datasets but adapted them to suit our needs. We were seeking to demonstrate modelling of insurance data but the demo’s focus was around the tooling, workflows and architecture we had built for modelling rather than any of the models themselves. That’s why, for this situation, synthetic data was perfect for us as we could tailor the dataset to the story we wanted to tell.

For the rest of this post, I’ll focus on the work generating the synthetic automobile dataset. First, we identified public data sets by searching on Google, Google Datasets and Kaggle. For automobile insurance, we found a dataset of American policies [3] and another covering Norwegian policies [4]. The two datasets differed in scope but between them, we had enough data to create a good feature set including our desired modelling targets, severity and frequency of claims.

The American data featured a rich selection of features but only 10K rows which were less than we needed to demonstrate the scale of data handling that the platform allows. Additionally, the dataset did not have the frequency of claims and the distribution of severity did not follow what we would typically expect for an insurance portfolio. The Norwegian dataset contained 200k rows which was better for demonstrating scale, featured more realistic distributions for frequency and severity of claims but had far fewer features. The distribution of severity is shown in Figure 1. The Norwegian data follows what we would expect for a typical portfolio where only a small number of clients are found to make claims whereas all the examples appear to have made a claim in the American data.

Figure 1. Comparison of customers with no claims and those who claimed over the insurance period within the open-source American and Norwegian datasets.

Following our exploration of the data, we chose to base our features upon the American dataset but build our targets to follow distributions seen in the Norwegian data.

When designing synthetic datasets, there are three main points to consider for tabular data:

Data schema.
Data distributions.
Data correlations.

Ideally, a realistic synthetic dataset should mimic all three of these elements when compared to a real dataset.

Fortunately, there has been a lot of work done in the area of generating synthetic data for applications such as differential privacy which provided us with a lot of literature and tools online to get up to speed quickly. Matching the data’s schema is the most straightforward aspect of data generation which simply requires extraction of an existing schema or potentially merging elements from multiple schemas. Many methods exist for extracting a schema from a dataset. For this work, we used the Python package Pandas.

To generate independent features with representative distributions, you can model the distribution found in the original data and then draw random samples from the fitted distribution to generate new data. For categorical data, a distribution can be modelled from a real dataset by calculating the percentage of frequency for each observed category. Once this has been calculated, you can draw from this distribution randomly to create new data. An example of how this can be done in Python is shown below, with a full notebook and sample data here: https://github.com/datascience-gft/blog_posts/tree/main/synthetic_insurance_data,

Generation of independent categorical features with representative distributions.

Figure 2 shows a visualisation that verifies the categorical synthetic data having matching distributions to the real-world observations.

Figure 2. Comparison of real and synthetic categorical data.

This approach to generating categorical data works well for low-cardinality data which is non-personally identifiable such as level of education or insurance policy type. For identifiable categorical data with high cardinalities, such as home address or name, it is better to generate entirely novel data; this avoids privacy issues and generation of repeat rows when scaling up synthetic datasets. To generate entirely novel categorical data, we use the Python package Faker [5] which features many generators for creating fake data such as fake addresses, names and emails. The generators are parametrised allowing for the creation of the required data content such as UK-based address or French male names. An example for generating fake addresses is shown below,

Generation of novel categorical data.

For numerical data, a probability distribution can be fitted to the data based on the understanding of the original data and a goodness of fit test. Multiple distributions can be fitted to the data using the maximum likelihood approach, a goodness of fit score can then be generated to identify the best match. An example for this using the Python package statsmodels [6] is shown below,

Generation of independent numerical data based on reference dataset.

As with the categorical data, once the distribution has been modelled, a sample can be drawn to generate a new synthetic measure with a representative distribution.

To create synthetic data which mimics both the distributions and also correlations between features, a more sophisticated method, which does not consider features in isolation, must be employed. One approach to preserving feature correlations is to fit a Gaussian copula to the data which is a multi-variant probability distribution [7]. A copula can be fitted across multiple features and then sampled to generate features that mimic both the distribution and correlation of the original data.

The Synthetic Data Vault is a Python package that brings together multiple techniques for generating correlated synthetic data and features generators including Gaussian Copula and GAN-based approaches [8, 9]. The authors of this package have demonstrated models generated on entirely synthetic datasets with comparable performance to models trained on the original real-world data [10]. An example below demonstrates how to generate synthetic data by fitting a Gaussian copula to the original dataset.

Generation of correlated synthetic features by modelling as Gaussian Coupla.

Figure 3 shows a correlation plot of the original data and the synthetic data which demonstrates the original correlations preserved in the synthetic dataset. Additionally, a comparison of the real and synthetic distributions is shown in Figure 4.

Figure 3. Comparison of correlations in the original and synthetic data generated from a Gaussian Copula fit.

Figure 4. Comparison of correlations in the original and synthetic data generated from a Gaussian Copula fit.

For the synthetic data in our insurance demo, we chose to create independent features which matched the schema and distribution of real features we found in the American automobile data. Due to the American insurance dataset lacking our desired modelling targets, we had a final step to complete, the generation of a synthetic target for modelling. Our requirements for the synthetic modelling targets were to follow distributions observed in the Norwegian dataset and to have some correlation with our synthetic features.

To create these features, we followed a similar approach to building a generalised linear model. First, we designed a linear formula and then used a link function to map the synthetic features to the mean of the target. Finally, we sampled from an appropriate distribution to provide noise. For the claim frequency, we used a Poisson distribution with a log link function. An example of how this was achieved practically is shown below:

Generation of a synthetic target for modelling demo.

Figure 5. Comparison distribution for original and synthetic target features.

Conclusions

In this post, I have described how a lack of suitable, publicly available insurance data led to the need for the generation of synthetic data to complete an insurance modelling demo. I have described the key areas to consider when generating synthetic features and tools available for use. Finally, I have discussed the generation of novel modelling targets which follow realistic distributions and feature correlations with the synthetic features. The development of machine learning methods and systems is tightly bound to the availability of datasets which ensures that synthetic datasets will continue to be an important area for applied machine learning work.

Example code and sample data

A public git repository which contains an example notebook covering this work along with sample datasets used can be found here, https://github.com/datascience-gft/blog_posts/tree/main/synthetic_insurance_data