Exploration of Deep Learning on Commercial Health Insurance Member Attrition

Piero Ferrante
CVS Health Tech Blog
7 min readOct 3, 2022

By: Shuang Men and Yunshen Chen (Shuang and Yunshen are data scientists in the Analytics & Behavior Change department at CVS Health)

Introduction

Member retention plays a key role in marketing, sales, and financial planning processes and is one of the primary financial indicators for companies with a subscription-based business model. From the point of view of a company, it is necessary to maintain a healthy member retention rate because acquiring new members or customers is often costlier than retaining current ones. This is especially true for the insurance industry, which has higher customer acquisition costs than many other industries, and keeping clients happy and loyal is a solid strategy for maintaining retention¹. The insights gained from attrition prediction help companies focus more on the members that are at a high risk of leaving². Given the complexity of health insurance data, deep learning can be suitable in attrition prediction from its advantages in executing feature engineering by itself³. In this work, we explored deep learning methodologies for an attrition forecast use case. We built a conventional Neural Network (NN), a Long Short-Term Memory (LSTM) and a LSTM/NN hybrid model for Aetna commercial health insurance membership attrition prediction. To gain intuition for model selection and model building process in this use case, the model performance and training times of the deep learning models were compared with those of a Gradient Boosting Machine (GBM) model.

Data, Architecture, and Models

Data Description

In this work, classical machine learning and deep learning models were built at the member-level with 100 thousand (K) records (125 MB in CSV format) and 1 million (MM) records (900 MB in CSV format). There are in total 254 features selected from Aetna internal databases. Out of these features, 98 of them are static features covering demographic, insurance plan, plan sponsor information, etc. These static features do not change over months in general. The remaining 156 features are features in timesteps. Those features in timesteps are retrieved across 13 continuous months and contain 12 types of monthly records such as the number of calls a member received. Training and validation sets were split at a 70/30 ratio in the model building process.

Architecture Comparison

Two computation platforms were used in the model training process with their hardware specifications listed as below:

1. CPU:

· Architecture: x86_64

· CPU(s): 72

· Thread(s) per core: 2

· Core(s) per socket: 18

· Socket(s): 2

· NUMA node(s): 2

· CPU MHz: 3399.884

2. GPU:

· GPU type: Tesla V100

· Product: GV100GL [Tesla V100 SXM2 32GB]

· Vendor: NVIDIA Corporation

· Version: a1

· Width: 64 bits

· Clock: 33MHz

These measurements are subject to variations due to the fact that they are being run in a shared computing environment; therefore, these results should be interpreted as directional as opposed to precise.

Deep Learning Models

Over the past 10 years, deep learning as a subset of machine learning has become more and more popular for AI type problems. One reason for this is that deep learning has repeatedly demonstrated its superior performance on a wide variety of tasks including speech, natural language, vision, and playing games. In this project, we explored three different structures of deep learning models that are potentially suitable for our use case and expanded our understanding of deep learning methodology.

Neural networks are computing systems inspired by biological neural networks, designed to perform different tasks with a large quantity of data involved⁴. Neural networks are comprised of node layers, containing an input layer, one or more hidden layers, and an output layer⁵ as shown in Figure 1 below. The conventional neural network model utilized here takes all features equally as input without distinguishing differences in the time sequence among the features.

Figure 1. Scheme of a Conventional NN Model

LSTM is an advanced architecture of Recurrent Neural Network (RNN) that was designed to model chronological sequences and their long-range dependencies more precisely than conventional RNNs or NNs⁶. The model takes data in timesteps as input. The timestep data gets processed step by step with dependencies as shown in the plot (Figure 2) below, to allow the model to remember and not forget important information from all steps so that both long-term and short-term memories can be captured.

Figure 2. Scheme of a LSTM Model

LSTM/NN hybrid model combines LSTM and conventional NN so that static and data in timesteps can be processed from different sessions inside this model as shown in Figure 3, which may achieve better model performance than using a standard LSTM or conventional NN model. Nowadays, such customized multi-model structures can be easily implemented from open-source python packages such as Tensorflow, Keras or Pytorch.

Figure 3. Scheme of a LSTM/NN hybrid model

Model Performance and Training Time

Model Performance Comparison

In general, deep learning models provide better performance at larger scales of data than classical machine learning models. Often, the best advice to improve model performance with a deep learning model is just to use more data. With classical machine learning algorithms, this quick and easy fix doesn’t work nearly as well and more complex methods are often required to improve model performance⁷. To verify this general understanding regarding the difference between classical machine learning models and deep learning models, we built NN, LSTM, LSTM/NN hybrid models and compared the performance of these deep learning models with a Gradient Boosting Machine (GBM) model at 100 K and 1 MM member records scales respectively.

The model performance results measured by Area Under the Receiver Operating Characteristic Curve (AUC) are listed in Table 1. At 100K member records scale, GBM beat all the tested deep learning models in terms of model performance. Comparing the performances of the three deep learning models against each other at this relatively lower scale data, the most complicated LSTM/NN hybrid model performed the worst. This can be explained by the lack of sufficient data to constrain the deep learning models, which likely causes overfitting⁸. At 1 MM data scale, performances for both GBM model and deep learning models are higher than their performances at 100 K scale. GBM model and LSTM/NN hybrid model achieved equivalently the highest AUC. Comparing the absolute AUC increase, NN model and LSTM/NN hybrid model achieved more than GBM model. This demonstrated the positive impact on model performance from a larger scale training set, especially for models with more parameters and more complicated structures.

Table 1. Model performance: relative AUC on validation set (the performance of GBM at 100 K used as performance baseline)

Training Time on Different Architectures

With the goal of understanding computation capability of different model training platforms, we implemented deep learning model training processes on both CPU and GPU architectures. The model training time on 1 MM member records is shown in Table 2. The GPU platform showed the highest computation speed as demonstrated from its shortest training time for the NN model. The usage of GPU has significantly shortened the training time for complex models, especially on large-scale data. Nowadays, GPU is considered the heart of Deep Learning, a part of Artificial Intelligence for extensive Graphical and Mathematical computations which frees up CPU cycles for other jobs⁹.

Table 2. Model training time in minutes on 1MM member records

Summary and Next Steps

· NN, LSTM, hybrid LSTM/NN and GBM models were built on million-scale member-level data to predict commercial health insurance member attrition propensity with desirable AUC values.

· With million-scale training data, hybrid LSTM/NN outperformed standard NN model and showed the benefit of customizing a deep learning model structure to fit different types of input features.

· Larger scale training data set provided more of a performance increase for deep learning models than for GBM model thanks to assistance from GPU that shortened model training time significantly.

· We will continue to explore the modeling of member attrition prediction by leveraging unstructured data as input features as well as other types of deep learning model structures and their combination.

· We will further test existing deep learning interpretation algorithms to better identify and understand key driving factors in health insurance member attrition propensity prediction.

References:

1. Why retention is so important for insurance agents — Agentero

2. https://towardsdatascience.com/the-data-scientists-guide-to-subscription-businesses-70b1fc4b4493

3. https://becominghuman.ai/deep-learning-and-its-5-advantages-eaeee1f31c86

4. https://www.educba.com/what-is-neural-networks/

5. https://www.ibm.com/cloud/learn/neural-networks

6. https://www.geeksforgeeks.org/understanding-of-lstm-networks/

7. https://towardsdatascience.com/deep-learning-vs-classical-machine-learning-9a42c6d48aa

8. https://machinelearningmastery.com/impact-of-dataset-size-on-deep-learning-model-skill-and-performance-estimates/

9. https://medium.com/@shachishah.ce/do-we-really-need-gpu-for-deep-learning-47042c02efe2#:~:text=CPUs%20have%20few%20complicated%20cores%20which%20run%20processes,CPU%20where%20as%20CUDA%20code%20runs%20on%20GPU

--

--

Piero Ferrante
CVS Health Tech Blog

Data Science Fellow at CVS Health with 15 years of applied ML and engineering experience in healthcare, adtech, and fintech.