Health Insurance Plan Prediction

11 min readJan 12, 2023

In this article, I will develop an algorithm in Python for suggesting the most suitable health insurance plans in the user’s area.

Source: https://roundtmc.com/health-insurance-explained/

Predicting health insurance premiums is a complex task that requires careful consideration of many factors. Health insurance premiums can vary by state for several reasons, including differences in the overall cost of living, the state of the healthcare market, and state-specific regulations. Some states may have higher costs for healthcare services, which can drive up insurance premiums. Additionally, some states may have stricter regulations on insurance companies, which can also impact premiums. It is important to note that location may not be the only factor that influences the rate of insurance, also demographic data like age, number of dependents, occupation, and others.

In this article, I will outline the basic steps involved in creating an algorithm for suggesting the most suitable healthcare plans to the user, as well as some of the considerations that need to be taken into account when developing such an algorithm. I will replicate the process followed by an insurance company called eHealth.com using Python libraries. eHealth is a website that provides tools for consumers to compare and purchase health insurance plans. The site likely uses a variety of factors to predict the cost of health insurance for an individual, such as:

Zip Code
Are you a business owner with employees?
Gender
Age
Tobacco use
Family coverage

It is important to note that predictions made by ehealth.com are based on the information provided by the individual and the cost may differ once the application is reviewed by the insurance company.

To replicate the procedure, I will use two datasets for the project. The first one is from Kaggle called US Health Insurance Dataset (1) and the second one is from healthcare.gov called 2023 QHP Landscape Data (2).

US Health Insurance Dataset: This dataset will be used to predict premium charges per user based on similar input factors that eHealth uses.
QHP Landscape Data: This dataset will be used to suggest health insurance plans based on location, county, and category.

The project is divided into four stages as follows:

Data Preparation
Regression
Classification
Suggestion

1. Data Preparation

(1). Starting with dataset (1), as shown in the picture below, there are seven columns in the dataset. Out of these, age, sex, children, and smoker match the input features used by eHealth. For this project, I will use charges as the target variable for regression.

#Importing first dataset
df=pd.read_csv('insurance.csv')
df

EDA After importing the dataset, I will check to see if the dataset contains any rows with NaN values. Furthermore, to make sure our machine learning model does not favor one category over the other, I will check to see if the categorical columns are balanced. Although not included in this article, this part is included in the code sections which are available in the GitHub repository.

To match the input features used by the website, I will add two columns to the dataset:

State: I will assign each row a random state based on the region they are in
Family: If children are present, I will assume the family will need coverage and add a binary column

#printing unique states
df['region'].unique()
Out: array(['southwest', 'southeast', 'northwest', 'northeast'], dtype=object)

#defining states per region
northwest=['OR','WA','ID','MT','WY']
northeast=['CT','ME','MA','NH','NJ','NY','PA','DC']
southeast=['AL','FL','GA','KY','MS','NC','SC','TN']
southwest=['AR','CO','LA','NM','ND','OK','SD','TX','UT']

#populating a new column based on state
regions=df['region']
state=[]
for i in regions:
    if i == 'southwest':
        state.append(random.choice(southwest))
    elif i == 'southeast':
        state.append(random.choice(southeast))
    elif i == 'northwest':
        state.append(random.choice(northwest))
    elif i == 'northeast':
        state.append(random.choice(northeast))
df['state']=state

#adding family column
temp=df['children']
family=[]
for i in temp:
    if i == 0:
        family.append('no')
    else:
        family.append('yes')
df['family']=family

Since I have added a state column, I do not need the region column anymore. I will also remove the bmi column as the algorithm needs to predict health insurance plans based on questions that do not require any form of medical test.

#dropping bmi and region column
df=df.drop(columns='bmi')
df=df.drop(columns='region')

After the data transformation is completed, the final shape of dataset (1) turns out to be:

(2). I will now import dataset (2)

#importing second dataset
df1=pd.read_excel('Individual_Market_Medical.xlsx',header=1)
df1.head()

Since dataset (2) is made up of 109880 rows × 148 columns, I will use the .head() function by Pandas to paste a snapshot of the dataset in this article.

Most of these columns are not required for our project, so I will filter and keep only relevant columns:

State Code
FIPS County Code
Metal Level (Label)
Issue Name
Plan Marketing Name

Furthermore, I will change the plan ‘Expanded Bronze’ to ‘Bronze’. Why? This question will be answered in the next part of the data preparation stage. The resulting shape of the data frame is changed to 109880 rows × 5 columns. Now, I will put this data frame aside for now and come back to it towards the end of the project.

#filtering out relevant columns
df1=df1[['State Code','FIPS County Code','Metal Level','Issuer Name','Plan Marketing Name']]
#renaming value from expanded bronze to bronze
df1['Metal Level'] = df1['Metal Level'].replace(['Expanded Bronze'], 'Bronze')

After studying both datasets, I have decided to use dataset (1) to perform regression and predict the expected medical charges for a user. For dataset (2), I have State Code, FIPS County Code (which I will populate later), to match with dataset (1). However, I still need to compare charges with Metal Level.

In order to match entries in dataset (1) with (2), I need to add a Metal Level category column in dataset (1). To do this, I will generate labels for each state by extracting all the tuples for that state and subtracting the highest value from the lowest. This will give me a range for each state which I will divide into three categories as defined below:

Bronze: 0–35%
Silver: 35–70%
Gold: 70–100%

The results, along with state and charges will be stored in another data frame named: df_classifcation

#creating an empty array of size -> df
labels=[None] * df.shape[0]
states=df['state'].unique()

#for each unique state the loop will run
for state in states:
    #creating a temporary list that includes all the charges present in each state
    value=df[df['state']==state]['charges'].values
    #sort that list in ascending order
    value.sort()
    length=len(value)
    #generating boundaries for each label
    lowest=round(length*.35)
    bronze=value[0:lowest]
    middle=round(length*.70)
    silver=value[lowest:middle]
    gold=value[middle:]
    #getting the id of each row 
    idx=df.index[df['state']==state]
    #the loop will run for each id present for a specific state
    for i in idx:
        #using the id, i am extracting the charges
        charge=df['charges'].values[i]
        #comparing charges to the label ranges to see which range it falls to
        #storing that label in the same index as 'df' in our pre-defined list-> labels
        if charge<=bronze[-1]:
            labels[i]='Bronze'
        elif (charge>bronze[-1]) and (charge<=silver[-1]):
            labels[i]='Silver'
        elif charge>silver[-1]:
            labels[i]='Gold'
        else:
            print('---')
            
#generating a new dataframe
df_classification=pd.DataFrame(labels,columns=['label'])
df_classification['state']=df['state']
df_classification['charges']=df['charges']
df_classification

2. Regression

In this section, I will build and train regression models to predict charges for the testing dataset. First, I will split the data into a training/testing ratio of 66/34 and scale the target variable (charges) so the resulting measures (MAE,MSE,RMSE,R2) would fall in the range between 0–1.

#splitting into train and test
x=df.drop(columns=['charges'])
y=df['charges']
y=np.array(y).reshape(-1,1)
#scaling the y column -> target variable 'charges' to get results in the range 0-1
scaler = StandardScaler()
scaler.fit(y)
y=scaler.transform(y)
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.34, random_state=30)


#label encoding the training variables
labelenc = preprocessing.LabelEncoder()
x['sex']= labelenc.fit_transform(x['sex'])
x['smoker']= labelenc.fit_transform(x['smoker'])
x['state']= labelenc.fit_transform(x['state'])
x['family']= labelenc.fit_transform(x['family'])

#splitting into train and test
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.34, random_state=30)

Now, I will apply four regression models using Scikit-Learn’s built-in functions to see which model performs the best on our dataset. The four regression techniques I will be using are:

Linear Regression
2nd Degree Polynomial Regression
Decision Tree Regression
Random Forest Regression

#linear regression
reg = LinearRegression().fit(x_train, y_train)
reg.score(x_train, y_train)
preds=reg.predict(x_test)
results.append([mean_squared_error(y_test,preds),median_absolute_error(y_test,preds),
               np.sqrt(mean_squared_error(y_test,preds)),r2_score(y_test, preds)])

#polynomial
polynomial_features= PolynomialFeatures(degree=2)
x_poly = polynomial_features.fit_transform(x_train)
x_poly_test=polynomial_features.fit_transform(x_test)
model = LinearRegression()
model.fit(x_poly, y_train)
preds=model.predict(x_poly_test)
results.append([mean_squared_error(y_test,preds),median_absolute_error(y_test,preds),
               np.sqrt(mean_squared_error(y_test,preds)),r2_score(y_test, preds)])
#decision tree regressor
regressor = DecisionTreeRegressor()
regressor.fit(x_train, y_train)
preds=regressor.predict(x_test)
results.append([mean_squared_error(y_test,preds),median_absolute_error(y_test,preds),
               np.sqrt(mean_squared_error(y_test,preds)),r2_score(y_test, preds)])
#random forest regressor
regressor = RandomForestRegressor(max_depth=4, random_state=1)
regressor.fit(x_train, y_train)
preds=regressor.predict(x_test)
results.append([mean_squared_error(y_test,preds),median_absolute_error(y_test,preds),
               np.sqrt(mean_squared_error(y_test,preds)),r2_score(y_test, preds)])

#Results
#create data
data = [["Linear",results[0][0],results[0][1],results[0][2],results[0][3]],
       ["2nd Degree Polynomial",results[1][0],results[1][1],results[1][2],results[1][3]],
       ["Decision Tree",results[2][0],results[2][1],results[2][2],results[2][3]],
       ["Random Forest",results[3][0],results[3][1],results[3][2],results[3][3]]]

  
#define header names
col_names = ["Model","MSE","MAE","RMSE","R2"]
  
#display table
print(tabulate(data, headers=col_names))

Model                       MSE        MAE      RMSE        R2
---------------------  --------  ---------  --------  --------
Linear                 0.266212  0.152315   0.515957  0.747221
2nd Degree Polynomial  0.265491  0.142202   0.515259  0.747905
Decision Tree          0.563809  0.0475985  0.750872  0.46464
Random Forest          0.275982  0.143762   0.525339  0.737944

The results shown above conclude that apart from the decision tree, all the rest give almost similar results. So, moving on, I will perform the regression again on our ideal model (Linear), this time without scaling our target variable to get predicted charges.

#splitting into train and test
x=df.drop(columns=['charges'])
y=df['charges']

#label encoding
labelenc = preprocessing.LabelEncoder()
x['sex']= labelenc.fit_transform(x['sex'])
x['smoker']= labelenc.fit_transform(x['smoker'])
x['state']= labelenc.fit_transform(x['state'])
x['family']= labelenc.fit_transform(x['family'])

#splitting into train and test
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.34, random_state=30)


#random forest regressor
regressor = RandomForestRegressor(max_depth=3, random_state=0)
regressor.fit(x_train, y_train)
preds=regressor.predict(x_test)

3. Classification

It is important to note that the website algorithm I am trying to replicate does not ask for charges, it automatically assigns the person labels based on their input factors. Since the dataset does not originally have labels, I generated these labels myself for each state. In order to see how well my charges are predicted, I will now use the ranges defined for each state in the data preparation part and use predicted charges to predict new labels. This will give me the final labels needed for each row, which I will use to filter out health insurance plans while also acting as a validation technique for my regression model.

##### predicting new labels
#adding the predicted charged to my result dataframe -> df_classification
idx=x_test.index
#keeping only the rows present in the testdataset
df_classification=df_classification.loc[idx]
df_classification['charges preds']=preds

#reseting index 
df_classification=df_classification.reset_index(inplace=False,drop=True)
#making an empty list of size df_classification
preds_labels=[None] * df_classification.shape[0]
#getting unique states
states=df['state'].unique()
#repeat the same process that was done earlier, this time on the predicted charges
#run for loop for each state
for state in states:
    #getting all the charges present in the complete dataframe
    value=df[df['state']==state]['charges'].values
    #sorting values in ascending order
    value.sort()
    length=len(value)
    #genrating the same ranges that were generated before
    lowest=round(length*.35)
    bronze=value[0:lowest]
    middle=round(length*.70)
    silver=value[lowest:middle]
    gold=value[middle:]
    #extracting the columns of the state that is being processed from my test dataframe
    idx=df_classification.index[df_classification['state']==state]
    #loop will run for each column
    for i in idx:
        #identify the predicted charge and assign a label
        charge=df_classification['charges preds'].values[i]
        if charge<=bronze[-1]:
            preds_labels[i]='Bronze'
        elif (charge>bronze[-1]) and (charge<=silver[-1]):
            preds_labels[i]='Silver'
        elif charge>silver[-1]:
            preds_labels[i]='Gold'
        else:
            print(state)
            print(charge)
            print(bronze[-1])
            print(silver[0],silver[-1])
            print(gold[0])
            print('---')
df_classification['labels preds']=preds_labels

#results for estimating the classification part
#confusion matrix
y_test=df_classification['label'].values
preds=df_classification['labels preds'].values
cm=confusion_matrix(y_test, preds,labels=["Bronze", "Silver", "Gold"])
cm_df = pd.DataFrame(cm,
                     index = ['Bronze','Silver','Gold'], 
                     columns = ['Bronze','Silver','Gold'])
#Plotting the confusion matrix
plt.figure(figsize=(6,4))
sns.heatmap(cm_df, annot=True,cmap="crest",fmt='g')
plt.title('Confusion Matrix')
plt.ylabel('Actal Values')
plt.xlabel('Predicted Values')
plt.show()

To see how correctly our labels are predicted, I generated a confusion matrix followed by a detailed analysis of how each label performed.

Bronze: 
 TP: 116 | FN: 38 | FP: 18 | TN: 283
Accuracy: 0.8769230769230769
TPR: 0.7532467532467533 | TNR: 0.9401993355481728
Silver: 
 TP: 119 | FN: 49 | FP: 49 | TN: 238
Accuracy: 0.7846153846153846
TPR: 0.7083333333333334 | TNR: 0.8292682926829268
Gold: 
 TP: 109 | FN: 24 | FP: 44 | TN: 278
Accuracy: 0.8505494505494505
TPR: 0.8195488721804511 | TNR: 0.8633540372670807

After running multiple iterations I can conclude:

Silver has the lowest accuracy each time
TPR is weaker than TNR

4. Suggestion

After calculating the predicted labels, I will now generate a final data frame for clarity and use that to match health insurance plans with my dataset (2). I will now remove the original charges and labels from the resulting data frame, df_classification, and add the input features to give the resulting data frame a clear image.

#Adding the feature column to my final dataframe -> df_classification
#This acts as a database, or final testing feature of my project
x_test=x_test.reset_index(inplace=False,drop=True)
df_classification['age']=x_test['age']
df_classification['sex']=x_test['sex']
df_classification['children']=x_test['children']
df_classification['smoker']=x_test['smoker']
#removing the original charges and labels column because we only need the predicted ones
df_classification=df_classification.drop(columns=['label','charges'])
df_classification

To filter out relevant health insurance plans for each user, I will further add a random county code to each test subject based on their state. This is being done because the original dataset did not have a zip code column, and having a zip code further narrows down the results and makes the suggestions more accurate.

Represented below is a random test case extracted from our final dataset df_classification.

#Getting a test case
test_case=df_classification.iloc[404:405,:]
print(test_case)

if test_case.values[0][0] in df1['State Code'].unique():
    #Getting state and class label of that test case
    test_state=test_case['state'].values[0]
    test_label=test_case['labels preds'].values[0]
    #making dataframe to filter out that state and label from the plan dataframe
    result=df1[(df1['State Code']==test_state) & (df1['Metal Level']==test_label)]
    #in order to remove computation time, I will randomly allocate the testcase a random
    # county code from that state
    temp=df1[df1['State Code']==test_state]['FIPS County Code'].unique()
    test_code=random.choice(temp)
    print("County Code allocated:", test_code)
    print('--------')
    result=result[result['FIPS County Code']==test_code]
    #getting unique issuers from that dataframe
    issuer=result['Issuer Name'].unique()
    #printing out all the issuers in that area and the plans
    for i in issuer:
        print("Plans by issuer:", i)
        print(result[result['Issuer Name']==i]['Plan Marketing Name'])
        print("")
else:
    print("This state is not present in the database")

state  charges preds labels preds  age  sex  children  smoker
404    FL    6060.813133       Bronze   30    1         2       0
County Code allocated: 12041
--------
Plans by issuer: Florida Blue (BlueCross BlueShield FL)
7070    BlueOptions Bronze 1419 ($0 Virtual Visits / R...
7073    BlueOptions Bronze 1416 ($0 Virtual Visits / 3...
7077    BlueOptions Bronze (HSA) 1705 (Rewards $$$ / $...
7079    BlueOptions Bronze 1707S ($0 Virtual Visits / ...
7081    BlueOptions Bronze 2119 ($0 Deductible / $50 P...
Name: Plan Marketing Name, dtype: object

Plans by issuer: Ambetter from Sunshine Health
7083                            Ambetter Essential Care 1
7084                        Ambetter Essential Care 2 HSA
7089                            Ambetter Essential Care 5
7090                           Ambetter Essential Care 22
7091    Ambetter Essential Care: $1,500 Medical Deduct...
7092       Ambetter Essential Care: $0 Medical Deductible
7098    Ambetter Essential Care 2 HSA + Vision + Adult...
7100    Ambetter Essential Care 1 + Vision + Adult Dental
7103    Ambetter Essential Care 5 + Vision + Adult Dental
7104    Ambetter Essential Care 22 + Vision + Adult De...
7105    Ambetter Essential Care: $1,500 Medical Deduct...
7106    Ambetter Essential Care: $0 Medical Deductible...
Name: Plan Marketing Name, dtype: object

Plans by issuer: Florida Blue HMO (a BlueCross BlueShield FL company)
7112    BlueCare Bronze 1486 ($0 Virtual Visits / Rewa...
7115    BlueCare Bronze 1483 ($0 Virtual Visits / 3 PC...
7119    BlueCare Bronze (HSA) 1765 (Rewards $$$ / $4 C...
7121    BlueCare Bronze 1767S ($0 Virtual Visits / $40...
7123    BlueCare Bronze 2179 ($0 Deductible / $50 PCP ...
Name: Plan Marketing Name, dtype: object

Discussion:

The scope of the project can be improved by applying the same algorithm to a dataset that instead of the region, already has state-specific and county/zipcode-specific rows. That will remove the process of randomly populating rows with these attributes. Also, if the second dataset had the price of each plan, the results could be filtered using the labels and the predicted prices to sort prices to further focus on the user requirements.

A Python Notebook is available on my GitHub if you would like to follow up on the code step-by-step, and offer insights, recommendations, and code improvement.

Health Insurance Plan Prediction

1. Data Preparation

2. Regression

3. Classification

4. Suggestion

Written by Taimoor Muhammad