Insurance AI: A Hybrid Approach Towards Feature Selection

Saama Analytics
Saama Analytics
Published in
4 min readMay 4, 2020

AI undeniably seems to be one of the most rapidly growing and promising technologies. It has been swiftly adopted by various industries and is being actively used to develop services like driverless transportation, image recognition, and pattern identification. Insurance companies have also started using AI for various purposes; however, it is not yet applied to tackle some of the core insurance processes like product underwriting, claims adjudication, customer retention, and fraud detection. There are several challenges in the application of AI solutions for the Insurance industry — one of the prominent ones being difficult feature selection for AI/ML algorithms.

Insurance companies deal with some of the most complex data, including personal details (date of birth, name, gender, and address), vehicle details, financial details, and other sensitive financial information. Even with the availability of such standardized and clean data, there is still a lack of development and application of AI-based solutions.

Challenges in Application of AI Solutions: When Too Much Data Becomes a Problem
After multiple iterations and discussions with insurance experts, I tried to identify the reason and concluded that the problem is — there is just too much data. For example, an insurance policy system has over 5,000 data points, including structured and unstructured data, and a large number of data-time fields and ID’s in its raw state. None of this data is beneficial for machine learning or any other AI application.

For any machine learning (ML) or AI-based algorithm to work, the key is feature selection, which involves identifying attributes that drive business insights. But the problem with feature selection is the sheer number of input variables available for processing.

Let’s take an example. In a single policy, there can be multiple vehicles, drivers, and insureds. Each driver has upwards of 20 rating variables like age, gender, marital status, driving history, limit, and coverages, among other parameters. Hence, creating an AI consumable data set becomes a challenge. So much so, that flattening the data just for a single policy can lead to a column count in thousands.

Solving the Feature Selection Problem: The Hybrid Approach
To solve this problem, we decided on a hybrid approach that helps in feature selection. The approach combines SME (subject matter expert) knowledge with data statistics and ML to generate features that can tackle product underwriting, claims adjudication, customer retention, fraud detection, and other similar related challenges.

Step 1: SME decides which tables or columns hold good data and can help the machine to generate the desired output.

This step usually reduces the data set by about 60%. For example, it can bring down the number of columns from 5000+ columns to around 2000.

Step 2: Data itself makes the decision about which columns or attributes are essential.

This assessment gets carried out by profiling each column according to various parameters such as uniqueness, completeness, and other aspects.

This step reduces the number of datasets by another 50%. As we take forward the example shown in the previous step, the 2000 columns get reduced to around 1000 columns.

The number of remaining attributes is still quite large for any deep learning model to be deployed unless you have access to extremely high computing power for training. To resolve this problem, we now introduce the third step.

Step 3: ML decides the feature selection.

Typically, the insurance business is driven by some core outcomes, such as customer lifetime value, written premium, and loss ratio. This outcome is derived from all the data collected across various systems.

We ran multiple predictive models (Random Forest, GBM), considering these variables as outcome throwing the entire data set at them. This process helped generate multiple results set with feature importance from various models. Aggregating this feature importance helped us identify the top variables that can train deep learning models for any outcome.

This step helps us reduce our data set by almost 80%. Assuming we applied this step to 1000 columns, we are left with about 200 columns.

As you can see, this process helped us reduce the data set by about 96% by extracting only the relevant features that can be used by machine learning to generate useful outcomes.

The hybrid approach involves using an SME to give the right starting point and direction to the data, and eventually, the ML algorithm takes over. This approach is an excellent example of how AI and human intelligence can work together to find a solution to the challenges faced by any organization. It also presents an example of great potential to the insurance industry, emphasizing that the industry can move forward with embracing AI technologies.

Author: Varun Chutani
Varun Chutani is an Associate Director at Saama Analytics with more than 12 years of experience in the P&C insurance industry, working on product development, actuarial modelling and AI/ML solutions.

--

--