Health Insurance Plan Prediction
In this article, I will develop an algorithm in Python for suggesting the most suitable health insurance plans in the user’s area.
Predicting health insurance premiums is a complex task that requires careful consideration of many factors. Health insurance premiums can vary by state for several reasons, including differences in the overall cost of living, the state of the healthcare market, and state-specific regulations. Some states may have higher costs for healthcare services, which can drive up insurance premiums. Additionally, some states may have stricter regulations on insurance companies, which can also impact premiums. It is important to note that location may not be the only factor that influences the rate of insurance, also demographic data like age, number of dependents, occupation, and others.
In this article, I will outline the basic steps involved in creating an algorithm for suggesting the most suitable healthcare plans to the user, as well as some of the considerations that need to be taken into account when developing such an algorithm. I will replicate the process followed by an insurance company called eHealth.com using Python libraries. eHealth is a website that provides tools for consumers to compare and purchase health insurance plans. The site likely uses a variety of factors to predict the cost of health insurance for an individual, such as:
- Zip Code
- Are you a business owner with employees?
- Gender
- Age
- Tobacco use
- Family coverage
It is important to note that predictions made by ehealth.com are based on the information provided by the individual and the cost may differ once the application is reviewed by the insurance company.
To replicate the procedure, I will use two datasets for the project. The first one is from Kaggle called US Health Insurance Dataset (1) and the second one is from healthcare.gov called 2023 QHP Landscape Data (2).
- US Health Insurance Dataset: This dataset will be used to predict premium charges per user based on similar input factors that eHealth uses.
- QHP Landscape Data: This dataset will be used to suggest health insurance plans based on location, county, and category.
The project is divided into four stages as follows:
- Data Preparation
- Regression
- Classification
- Suggestion
1. Data Preparation
(1). Starting with dataset (1), as shown in the picture below, there are seven columns in the dataset. Out of these, age, sex, children, and smoker match the input features used by eHealth. For this project, I will use charges as the target variable for regression.
#Importing first dataset
df=pd.read_csv('insurance.csv')
df
EDA After importing the dataset, I will check to see if the dataset contains any rows with NaN values. Furthermore, to make sure our machine learning model does not favor one category over the other, I will check to see if the categorical columns are balanced. Although not included in this article, this part is included in the code sections which are available in the GitHub repository.
To match the input features used by the website, I will add two columns to the dataset:
- State: I will assign each row a random state based on the region they are in
- Family: If children are present, I will assume the family will need coverage and add a binary column
#printing unique states
df['region'].unique()
Out: array(['southwest', 'southeast', 'northwest', 'northeast'], dtype=object)
#defining states per region
northwest=['OR','WA','ID','MT','WY']
northeast=['CT','ME','MA','NH','NJ','NY','PA','DC']
southeast=['AL','FL','GA','KY','MS','NC','SC','TN']
southwest=['AR','CO','LA','NM','ND','OK','SD','TX','UT']
#populating a new column based on state
regions=df['region']
state=[]
for i in regions:
if i == 'southwest':
state.append(random.choice(southwest))
elif i == 'southeast':
state.append(random.choice(southeast))
elif i == 'northwest':
state.append(random.choice(northwest))
elif i == 'northeast':
state.append(random.choice(northeast))
df['state']=state
#adding family column
temp=df['children']
family=[]
for i in temp:
if i == 0:
family.append('no')
else:
family.append('yes')
df['family']=family
Since I have added a state column, I do not need the region column anymore. I will also remove the bmi column as the algorithm needs to predict health insurance plans based on questions that do not require any form of medical test.
#dropping bmi and region column
df=df.drop(columns='bmi')
df=df.drop(columns='region')
After the data transformation is completed, the final shape of dataset (1) turns out to be:
(2). I will now import dataset (2)
#importing second dataset
df1=pd.read_excel('Individual_Market_Medical.xlsx',header=1)
df1.head()
Since dataset (2) is made up of 109880 rows × 148 columns, I will use the .head() function by Pandas to paste a snapshot of the dataset in this article.
Most of these columns are not required for our project, so I will filter and keep only relevant columns:
- State Code
- FIPS County Code
- Metal Level (Label)
- Issue Name
- Plan Marketing Name
Furthermore, I will change the plan ‘Expanded Bronze’ to ‘Bronze’. Why? This question will be answered in the next part of the data preparation stage. The resulting shape of the data frame is changed to 109880 rows × 5 columns. Now, I will put this data frame aside for now and come back to it towards the end of the project.
#filtering out relevant columns
df1=df1[['State Code','FIPS County Code','Metal Level','Issuer Name','Plan Marketing Name']]
#renaming value from expanded bronze to bronze
df1['Metal Level'] = df1['Metal Level'].replace(['Expanded Bronze'], 'Bronze')
After studying both datasets, I have decided to use dataset (1) to perform regression and predict the expected medical charges for a user. For dataset (2), I have State Code, FIPS County Code (which I will populate later), to match with dataset (1). However, I still need to compare charges with Metal Level.
In order to match entries in dataset (1) with (2), I need to add a Metal Level category column in dataset (1). To do this, I will generate labels for each state by extracting all the tuples for that state and subtracting the highest value from the lowest. This will give me a range for each state which I will divide into three categories as defined below:
- Bronze: 0–35%
- Silver: 35–70%
- Gold: 70–100%
The results, along with state and charges will be stored in another data frame named: df_classifcation
#creating an empty array of size -> df
labels=[None] * df.shape[0]
states=df['state'].unique()
#for each unique state the loop will run
for state in states:
#creating a temporary list that includes all the charges present in each state
value=df[df['state']==state]['charges'].values
#sort that list in ascending order
value.sort()
length=len(value)
#generating boundaries for each label
lowest=round(length*.35)
bronze=value[0:lowest]
middle=round(length*.70)
silver=value[lowest:middle]
gold=value[middle:]
#getting the id of each row
idx=df.index[df['state']==state]
#the loop will run for each id present for a specific state
for i in idx:
#using the id, i am extracting the charges
charge=df['charges'].values[i]
#comparing charges to the label ranges to see which range it falls to
#storing that label in the same index as 'df' in our pre-defined list-> labels
if charge<=bronze[-1]:
labels[i]='Bronze'
elif (charge>bronze[-1]) and (charge<=silver[-1]):
labels[i]='Silver'
elif charge>silver[-1]:
labels[i]='Gold'
else:
print('---')
#generating a new dataframe
df_classification=pd.DataFrame(labels,columns=['label'])
df_classification['state']=df['state']
df_classification['charges']=df['charges']
df_classification
2. Regression
In this section, I will build and train regression models to predict charges for the testing dataset. First, I will split the data into a training/testing ratio of 66/34 and scale the target variable (charges) so the resulting measures (MAE,MSE,RMSE,R2) would fall in the range between 0–1.
#splitting into train and test
x=df.drop(columns=['charges'])
y=df['charges']
y=np.array(y).reshape(-1,1)
#scaling the y column -> target variable 'charges' to get results in the range 0-1
scaler = StandardScaler()
scaler.fit(y)
y=scaler.transform(y)
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.34, random_state=30)
#label encoding the training variables
labelenc = preprocessing.LabelEncoder()
x['sex']= labelenc.fit_transform(x['sex'])
x['smoker']= labelenc.fit_transform(x['smoker'])
x['state']= labelenc.fit_transform(x['state'])
x['family']= labelenc.fit_transform(x['family'])
#splitting into train and test
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.34, random_state=30)
Now, I will apply four regression models using Scikit-Learn’s built-in functions to see which model performs the best on our dataset. The four regression techniques I will be using are:
- Linear Regression
- 2nd Degree Polynomial Regression
- Decision Tree Regression
- Random Forest Regression
#linear regression
reg = LinearRegression().fit(x_train, y_train)
reg.score(x_train, y_train)
preds=reg.predict(x_test)
results.append([mean_squared_error(y_test,preds),median_absolute_error(y_test,preds),
np.sqrt(mean_squared_error(y_test,preds)),r2_score(y_test, preds)])
#polynomial
polynomial_features= PolynomialFeatures(degree=2)
x_poly = polynomial_features.fit_transform(x_train)
x_poly_test=polynomial_features.fit_transform(x_test)
model = LinearRegression()
model.fit(x_poly, y_train)
preds=model.predict(x_poly_test)
results.append([mean_squared_error(y_test,preds),median_absolute_error(y_test,preds),
np.sqrt(mean_squared_error(y_test,preds)),r2_score(y_test, preds)])
#decision tree regressor
regressor = DecisionTreeRegressor()
regressor.fit(x_train, y_train)
preds=regressor.predict(x_test)
results.append([mean_squared_error(y_test,preds),median_absolute_error(y_test,preds),
np.sqrt(mean_squared_error(y_test,preds)),r2_score(y_test, preds)])
#random forest regressor
regressor = RandomForestRegressor(max_depth=4, random_state=1)
regressor.fit(x_train, y_train)
preds=regressor.predict(x_test)
results.append([mean_squared_error(y_test,preds),median_absolute_error(y_test,preds),
np.sqrt(mean_squared_error(y_test,preds)),r2_score(y_test, preds)])
#Results
#create data
data = [["Linear",results[0][0],results[0][1],results[0][2],results[0][3]],
["2nd Degree Polynomial",results[1][0],results[1][1],results[1][2],results[1][3]],
["Decision Tree",results[2][0],results[2][1],results[2][2],results[2][3]],
["Random Forest",results[3][0],results[3][1],results[3][2],results[3][3]]]
#define header names
col_names = ["Model","MSE","MAE","RMSE","R2"]
#display table
print(tabulate(data, headers=col_names))
Model MSE MAE RMSE R2
--------------------- -------- --------- -------- --------
Linear 0.266212 0.152315 0.515957 0.747221
2nd Degree Polynomial 0.265491 0.142202 0.515259 0.747905
Decision Tree 0.563809 0.0475985 0.750872 0.46464
Random Forest 0.275982 0.143762 0.525339 0.737944
The results shown above conclude that apart from the decision tree, all the rest give almost similar results. So, moving on, I will perform the regression again on our ideal model (Linear), this time without scaling our target variable to get predicted charges.
#splitting into train and test
x=df.drop(columns=['charges'])
y=df['charges']
#label encoding
labelenc = preprocessing.LabelEncoder()
x['sex']= labelenc.fit_transform(x['sex'])
x['smoker']= labelenc.fit_transform(x['smoker'])
x['state']= labelenc.fit_transform(x['state'])
x['family']= labelenc.fit_transform(x['family'])
#splitting into train and test
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.34, random_state=30)
#random forest regressor
regressor = RandomForestRegressor(max_depth=3, random_state=0)
regressor.fit(x_train, y_train)
preds=regressor.predict(x_test)
3. Classification
It is important to note that the website algorithm I am trying to replicate does not ask for charges, it automatically assigns the person labels based on their input factors. Since the dataset does not originally have labels, I generated these labels myself for each state. In order to see how well my charges are predicted, I will now use the ranges defined for each state in the data preparation part and use predicted charges to predict new labels. This will give me the final labels needed for each row, which I will use to filter out health insurance plans while also acting as a validation technique for my regression model.
##### predicting new labels
#adding the predicted charged to my result dataframe -> df_classification
idx=x_test.index
#keeping only the rows present in the testdataset
df_classification=df_classification.loc[idx]
df_classification['charges preds']=preds
#reseting index
df_classification=df_classification.reset_index(inplace=False,drop=True)
#making an empty list of size df_classification
preds_labels=[None] * df_classification.shape[0]
#getting unique states
states=df['state'].unique()
#repeat the same process that was done earlier, this time on the predicted charges
#run for loop for each state
for state in states:
#getting all the charges present in the complete dataframe
value=df[df['state']==state]['charges'].values
#sorting values in ascending order
value.sort()
length=len(value)
#genrating the same ranges that were generated before
lowest=round(length*.35)
bronze=value[0:lowest]
middle=round(length*.70)
silver=value[lowest:middle]
gold=value[middle:]
#extracting the columns of the state that is being processed from my test dataframe
idx=df_classification.index[df_classification['state']==state]
#loop will run for each column
for i in idx:
#identify the predicted charge and assign a label
charge=df_classification['charges preds'].values[i]
if charge<=bronze[-1]:
preds_labels[i]='Bronze'
elif (charge>bronze[-1]) and (charge<=silver[-1]):
preds_labels[i]='Silver'
elif charge>silver[-1]:
preds_labels[i]='Gold'
else:
print(state)
print(charge)
print(bronze[-1])
print(silver[0],silver[-1])
print(gold[0])
print('---')
df_classification['labels preds']=preds_labels
#results for estimating the classification part
#confusion matrix
y_test=df_classification['label'].values
preds=df_classification['labels preds'].values
cm=confusion_matrix(y_test, preds,labels=["Bronze", "Silver", "Gold"])
cm_df = pd.DataFrame(cm,
index = ['Bronze','Silver','Gold'],
columns = ['Bronze','Silver','Gold'])
#Plotting the confusion matrix
plt.figure(figsize=(6,4))
sns.heatmap(cm_df, annot=True,cmap="crest",fmt='g')
plt.title('Confusion Matrix')
plt.ylabel('Actal Values')
plt.xlabel('Predicted Values')
plt.show()
To see how correctly our labels are predicted, I generated a confusion matrix followed by a detailed analysis of how each label performed.
Bronze:
TP: 116 | FN: 38 | FP: 18 | TN: 283
Accuracy: 0.8769230769230769
TPR: 0.7532467532467533 | TNR: 0.9401993355481728
Silver:
TP: 119 | FN: 49 | FP: 49 | TN: 238
Accuracy: 0.7846153846153846
TPR: 0.7083333333333334 | TNR: 0.8292682926829268
Gold:
TP: 109 | FN: 24 | FP: 44 | TN: 278
Accuracy: 0.8505494505494505
TPR: 0.8195488721804511 | TNR: 0.8633540372670807
After running multiple iterations I can conclude:
- Silver has the lowest accuracy each time
- TPR is weaker than TNR
4. Suggestion
After calculating the predicted labels, I will now generate a final data frame for clarity and use that to match health insurance plans with my dataset (2). I will now remove the original charges and labels from the resulting data frame, df_classification, and add the input features to give the resulting data frame a clear image.
#Adding the feature column to my final dataframe -> df_classification
#This acts as a database, or final testing feature of my project
x_test=x_test.reset_index(inplace=False,drop=True)
df_classification['age']=x_test['age']
df_classification['sex']=x_test['sex']
df_classification['children']=x_test['children']
df_classification['smoker']=x_test['smoker']
#removing the original charges and labels column because we only need the predicted ones
df_classification=df_classification.drop(columns=['label','charges'])
df_classification
To filter out relevant health insurance plans for each user, I will further add a random county code to each test subject based on their state. This is being done because the original dataset did not have a zip code column, and having a zip code further narrows down the results and makes the suggestions more accurate.
Represented below is a random test case extracted from our final dataset df_classification.
#Getting a test case
test_case=df_classification.iloc[404:405,:]
print(test_case)
if test_case.values[0][0] in df1['State Code'].unique():
#Getting state and class label of that test case
test_state=test_case['state'].values[0]
test_label=test_case['labels preds'].values[0]
#making dataframe to filter out that state and label from the plan dataframe
result=df1[(df1['State Code']==test_state) & (df1['Metal Level']==test_label)]
#in order to remove computation time, I will randomly allocate the testcase a random
# county code from that state
temp=df1[df1['State Code']==test_state]['FIPS County Code'].unique()
test_code=random.choice(temp)
print("County Code allocated:", test_code)
print('--------')
result=result[result['FIPS County Code']==test_code]
#getting unique issuers from that dataframe
issuer=result['Issuer Name'].unique()
#printing out all the issuers in that area and the plans
for i in issuer:
print("Plans by issuer:", i)
print(result[result['Issuer Name']==i]['Plan Marketing Name'])
print("")
else:
print("This state is not present in the database")
state charges preds labels preds age sex children smoker
404 FL 6060.813133 Bronze 30 1 2 0
County Code allocated: 12041
--------
Plans by issuer: Florida Blue (BlueCross BlueShield FL)
7070 BlueOptions Bronze 1419 ($0 Virtual Visits / R...
7073 BlueOptions Bronze 1416 ($0 Virtual Visits / 3...
7077 BlueOptions Bronze (HSA) 1705 (Rewards $$$ / $...
7079 BlueOptions Bronze 1707S ($0 Virtual Visits / ...
7081 BlueOptions Bronze 2119 ($0 Deductible / $50 P...
Name: Plan Marketing Name, dtype: object
Plans by issuer: Ambetter from Sunshine Health
7083 Ambetter Essential Care 1
7084 Ambetter Essential Care 2 HSA
7089 Ambetter Essential Care 5
7090 Ambetter Essential Care 22
7091 Ambetter Essential Care: $1,500 Medical Deduct...
7092 Ambetter Essential Care: $0 Medical Deductible
7098 Ambetter Essential Care 2 HSA + Vision + Adult...
7100 Ambetter Essential Care 1 + Vision + Adult Dental
7103 Ambetter Essential Care 5 + Vision + Adult Dental
7104 Ambetter Essential Care 22 + Vision + Adult De...
7105 Ambetter Essential Care: $1,500 Medical Deduct...
7106 Ambetter Essential Care: $0 Medical Deductible...
Name: Plan Marketing Name, dtype: object
Plans by issuer: Florida Blue HMO (a BlueCross BlueShield FL company)
7112 BlueCare Bronze 1486 ($0 Virtual Visits / Rewa...
7115 BlueCare Bronze 1483 ($0 Virtual Visits / 3 PC...
7119 BlueCare Bronze (HSA) 1765 (Rewards $$$ / $4 C...
7121 BlueCare Bronze 1767S ($0 Virtual Visits / $40...
7123 BlueCare Bronze 2179 ($0 Deductible / $50 PCP ...
Name: Plan Marketing Name, dtype: object
Discussion:
The scope of the project can be improved by applying the same algorithm to a dataset that instead of the region, already has state-specific and county/zipcode-specific rows. That will remove the process of randomly populating rows with these attributes. Also, if the second dataset had the price of each plan, the results could be filtered using the labels and the predicted prices to sort prices to further focus on the user requirements.
A Python Notebook is available on my GitHub if you would like to follow up on the code step-by-step, and offer insights, recommendations, and code improvement.