Statistical Analysis of Health Insurance Cost Using R

Adrita G
14 min readJul 9, 2024

--

Image: Reference Link

Introduction

The basis of health insurance is financial protection against the potentially excessive expenses of healthcare. Individuals pay a regular premium, usually once a month or once a year. The rate of premiums varies based on a few factors. For instance, an individual’s age, health history, region, and lifestyle choices like smoking, deductibles, and co-payments should ideally affect health insurance costs rather than an individual’s gender.

To promote a more inclusive framework, this unique aspect of health insurance aims to guarantee affordability and accessibility for everyone. In short, health insurance acts as a safety net for individuals’ finances, enabling them to receive essential medical care without worrying about daily expenses. This encourages a community that is healthier and safer.

An analysis of the health insurance dataset has been documented in this article. This dataset contains the insurance details of 1338 US residents.

Dataset: Health-Insurance-Dataset.csv
Variables: Age, BMI, Children, Expenses, Sex, Smoker, Region
R Script: Health-Insurance-Data-Analysis

By predicting possible expenses based on individual characteristics, policyholders can decide on appropriate premium rates through data analysis. Since it makes data manipulation, statistical modelling, and visualisation easier, coding in R script is a helpful method for assessing health insurance statistics.

Visualising the distribution and summarising statistics of each variable

Summarisation of each variable:

The dataset consists of 1338 observations with 7 variables (shown in Table 1). This information helps to investigate relationships and dependencies among variables.

Table 1: Health Insurance Dataset Description

The Mean, Median and Quartiles of numerical variables provide measures of central tendency, which illustrate the average values within a dataset and Standard deviation defines the spread or variability of values around the mean (shown in Table 2). The gender distribution, smoking habits and regional representation provide insights into demographic variations within the dataset. It allows targeted analyses and evaluations of various factors influencing health insurance outcomes (shown in Table 3).

Table 2: Summary of Numerical variables
Table 3: Summary of Categorical variables (Based on count of observations)

Visualisation and summary statistics:

Age: The age distribution is diverse, ranging from 18 to 64 years old, with the majority of people falling between 27 and 51 years old.
BMI: It includes a range of 15.96 to 53.13, most of which falls between 26.30 and 34.69, showing that the insured persons have different body mass percentages.
Children: The Children group shows that, with a mean of almost 1.095 children, the majority of the individuals either have no children or up to two dependents.
Expenses: The distribution of medical insurance costs is shown by this group, which falls between 1,122 and 63,770, with the majority of claims lying between 4,740 and 16,640. This group also shows the amount that people with insurance must pay for healthcare.
Sex: In this category, there are 676 men and 662 women in the dataset.
Smoker: The distribution of this group, which includes 274 smokers and 1,064 non-smokers, reflects the prevalence of smoking among those with insurance.
Region: Individuals in the Region group are spread out geographically throughout four regions; Northeast (324), Northwest (325), Southeast (364), and Southwest (325).

The frequency distribution of numerical variables Age, BMI and Expenses are visualised through both box plots and histograms, and the frequency distribution of Children and all the categorical variables Sex, Smoker and Region are visualised through bar graphs (Appendix 1 and Appendix 2).

Note: for visualization, the library(ggplot2) was installed in R-Studio and “ggplot” was used.

Appendix 1: Boxplot of Age, BMI and Expenses to show
central tendencies (Outliers in BMI and Expenses variables) and from the graph of distribution also it’s evident, only variable BMI is nearly nearly normally distributed.
Appendix 2: The count of all the categories is shown in the graph

Evaluation:

The dataset offers insightful information about the insured population’s health characteristics, medical expenses, and demographics. It shows a wide range of ages, with the bulk of people falling within the 27–51 age range. The distribution of genders is almost equal between males and females. The range of BMIs shows how different individual body mass varies. There is a noticeable predominance of smoking behaviours, and the majority either have no children or up to two dependents. People are spread out over four geographical zones.

Testing the assumption of independence among predictor variables using correlation coefficients and association tests

Before proceeding with any statistical analysis, it’s very important to assess whether the data follows a normal distribution. It’s crucial to check for normality to ensure the validity of the analyses. To check the normality, the Shapiro-Wilk normality test was performed.
The method of statistical analysis to test the predicted independence among predictor variables includes correlation coefficients, scatter plots, and a correlation matrix.

The Spearman correlation, also known as Spearman’s rank correlation coefficient or Spearman’s rho is a non-parametric measure of correlation between two variables. It quantifies the linear relationship between two continuous variables. Unlike Pearson correlation, which assesses the linear relationship between variables, Spearman correlation measures the strength and direction of the monotonic relationship between variables. Spearman correlation values range from -1 to 1. A positive rho signifies a positive monotonic correlation, while a negative rho indicates a monotonic correlation.

Scatter plot visualisation enables qualitative analysis of the nature of the association, whilst the correlation matrix provides a quantitative analysis by the comprehensive overview of the pairwise correlations between each predictor variable.

This method helps to explain the relation between continuous variables like Age, Children, BMI and Expenses. However, Sex (binary), Smoker (binary) and Region are categorical variables, so it is not required to apply these methods to them. The use of correlation coefficients on categorical variables may be misleading because they are intended for continuous variables. For categorical variables, the Chi-square association test is being used.

This method assesses whether there is a significant association among all categorical predictor variables in the dataset. The null hypothesis is “The variables are independent”. Based on the normality test, it is going to be decided whether all the continuous variables are normally distributed or not.

Null hypothesis (H0): The data follows a normal distribution.
Alternative hypothesis (H1): The data does not follow a normal distribution.

As per the Shapiro-Wilk Test, the more the value of ‘W’ is close to 1, the more likely the data is normally distributed. If the p-value is less than the chosen significance level (0.05), it suggests strong evidence against the null hypothesis.

After getting the result of the normality test (Table 4), it’s clear that none of the numerical variables are normally distributed. So, for correlation, using Pearson correlation is inaccurate as this correlation assumes that the variables are both normally distributed and have linear relationships with each other. Based on the current data, the Spearman correlation is appropriate. The correlation matrix (Table 5) provides insights into the relationships between different variables in the dataset.

Table 4: Results for Normality tests for each continuous independent variable
Table 5: Spearman correlation matrix diagram for numerical independent variables

Association test interpretation among categorical variables (Table 6)

Sex and Smoker: The chi-squared test between ‘Sex’ and ‘Smoker’ yielded a chi-squared value of 7.3929 with 1 degree of freedom. The associated p-value is 0.006548. Since the p-value is less than the common significance level of 0.05, the null hypothesis has been rejected. This suggests that there is evidence to support an association between ‘Sex’ and ‘Smoker’. In other words, being male or female may be associated with differing smoking habits among the individuals in the dataset.

Sex and Region: The chi-squared test between ‘Sex’ and ‘Region’ resulted in a chi-squared value of 0.43514 with 3 degrees of freedom. The p-value associated with this test is 0.9329, which is greater than 0.05. Therefore, the rejection of the null hypothesis has failed. This indicates that there is insufficient evidence to suggest an association between ‘Sex’ and ‘Region’ in the dataset. In other words, the distribution of sexes does not significantly differ across the various geographic regions.

Region and Smoker: The chi-squared test between ‘Region’ and ‘Smoker’ yielded a chi-squared value of 7.3435 with 3 degrees of freedom. The associated p-value is 0.06172. Since the p-value is greater than 0.05, rejection of the null hypothesis has failed. This suggests that there is not enough evidence to conclude that there is an association between ‘Region’ and ‘Smoker’ in the dataset at the chosen significance level. However, it’s worth noting that the p-value is relatively close to 0.05, indicating a marginal level of significance. Therefore, further investigation or analysis may be warranted to explore this potential association further.

Table 6: Association test result, Chi-square test for categorical independent variables

Evaluation:

The analysis conducted suggests several key findings. Firstly, none of the numerical variables (Age, BMI, Children, Expenses) were found to be normally distributed based on Shapiro-Wilk tests. Therefore, the Spearman correlation was used instead of the Pearson correlation to examine relationships between variables. Moderate positive correlations were found between Age and Expenses, indicating that older individuals tend to have higher medical insurance expenses. Weak positive correlations were observed between BMI and Expenses, as well as between Children and Expenses. Additionally, a chi-square test revealed an association between Sex and Smoking habits, with evidence suggesting that gender may influence smoking behaviour. However, no significant association was found between Sex and Region, nor between Region and Smoking habits. These findings provide valuable insights into potential factors influencing medical insurance expenses and smoking behaviours among individuals in the dataset.

Employing linear regression modelling to describe variables that influence Expenses

A multiple linear regression is applied to model the relationship between the dependent variable (Expenses) and the rest of the independent variables (Age, Sex, BMI, Children, Smoker, Region). The adaptability of this strategy in allowing numerous predictors at once is the reason it was selected. It quantifies each variable’s effect on the dependent variable by calculating coefficients using the least squares method, which helps to discover important factors. The R-squared statistic evaluates the explanatory power of the model by indicating the proportion of the dependent variable’s variance that the predictors accounted for. The accuracy of the model is evaluated through residual analysis, which also validates that the uniformity, independence, and linearity assumptions are satisfied. Overall, multiple linear regression provides a comprehensive framework for understanding and quantifying the complex relationships within datasets, enabling robust predictions and insights into the factors influencing the outcome variable, crucial for informed decision-making.

The multiple linear regression model is used to predict the medical insurance expenses based on several predictor variables: age, sex, BMI, number of children, smoking habits, and region (Appendix 3).

Coefficients descriptions:

Intercept: The intercept represents the estimated mean value of expenses when all other predictor variables are zero. In this case, it’s estimated to be -11938.5.
Age: For each one-year increase in age, the expenses are estimated to increase by $256.9.
Sex_male: This coefficient represents the difference in expenses between males and females. It suggests that being male is associated with a decrease of $131.3 in expenses, but it’s not statistically significant (p-value = 0.693).
BMI: For each one-unit increase in BMI, the expenses are estimated to increase by $339.2.
Children: For each additional child, the expenses are estimated to increase by $475.5.
Smoker_yes: This coefficient represents the difference in expenses between smokers and non-smokers. Being a smoker is associated with a significant increase in expenses of $23848.5.
Region northwest, region southeast, region southwest: These coefficients represent the differences in expenses between the reference region (northeast) and the other regions. However, only the coefficient for ‘region southeast’ is statistically significant. It suggests that being in the southeast region is associated with a decrease in expenses by $1035.0 compared to the northeast region.

Model presentation:

Residuals: The residuals represent the differences between the observed Expenses and the Expenses predicted by the model.
Residual standard error: This is an estimate of the standard deviation of the residuals. In this case, it’s approximately $6062.
Multiple R2: This is a measure of how well the model explains the variability in the dependent variable (expenses). It indicates that approximately 75.09% of the variability in expenses can be explained by the predictor variables.
Adjusted R2: This is the R-squared value adjusted for the number of predictors in the model. It’s slightly lower than the multiple R-squared.
F-statistic: This is a test statistic used to assess the overall significance of the model. A high F-statistic (500.8) with a very low p-value (< 2.2e-16) indicates that the overall model is statistically significant.

Overall, the model suggests that age, BMI, number of children, smoking habits, and region are significant predictors of medical insurance expenses. However, the effect of sex and region (except southeast) on expenses is not statistically significant in this model.

Appendix 3: The snapshot of RStudio, where after calling the regression model coefficients, Residual standard error, multiple R-squared values, and Adjusted R-squared value, everything has come as a result.

Evaluation:

The multiple linear regression model provides valuable insights into the factors influencing medical insurance expenses. Significant predictors include age, BMI, number of children, smoking habits, and region, while sex and region (excluding southeast) show non-significant effects. Being a smoker is associated with a substantial increase in expenses while residing in the southeast region is linked to a decrease in expenses compared to the northeast. The model’s high multiple R-squared value (75.09%) indicates a good fit, suggesting that approximately three-quarters of the variation in expenses can be explained by the predictor variables. However, it’s important to note the residual standard error to understand the variability in predictions. Overall, this analysis underscores the importance of considering multiple factors when estimating medical insurance expenses, providing valuable insights for insurers, policymakers, and individuals seeking to understand and manage healthcare costs.

Evaluating differences in predictor variables with the assumption that each predictor variable is independent of the others, concerning a newly formed categorical variable, EXPENSE-split

To evaluate differences in predictor variables with the assumption of independence, a new categorical variable, EXPENSE-split has been created, based on the expenses incurred by individuals in the dataset. This variable divides the data into two groups: one representing individuals with high medical insurance expenses and the other representing those with low expenses. Splitting the expenses has been done based on a threshold value, the median, separating individuals above the threshold into the “high” group and those below into the “low” group.

For numerical variables like Age, BMI and Children, a non-parametric test, the Mann-Whitney U test (also known as the Wilcoxon rank-sum test) is being used to assess whether there is a significant difference between the distributions of two independent groups.

Null hypothesis (H0): There is no difference in the distributions of age/BMI/children between individuals with high expenses and individuals with low expenses groups.
Alternative Hypothesis (H1): There is a difference in the distributions of age/BMI/children between individuals with high expenses and individuals with low expenses groups.

For categorical variables like sex, smoker status, and region, which have more than two categories, we use chi-square tests. Chi-square tests assess whether there is a significant association between two categorical variables. In this case, we want to determine if there is an association between the categorical variables (e.g., sex, smoker status, region) and the high or low-expenses groups.

Null hypothesis (H0): There is no association between the categorical variables and the high or low-expense groups.
Alternative Hypothesis (H1): There is an association between the categorical variable and the high or low-expense groups.

Since it’s evident that all the continuous variables data are not normally distributed, to assess the difference between the two groups, the decision to perform a Mann-Whitney U test instead of a T-test based on age, BMI and children variables is appropriate. Splitting the expense data based on the threshold value Median is also appropriate here.

The Wilcoxon rank sum tests (Table 7) were conducted to compare age, BMI, and the number of children between groups defined by EXPENSE_split. For age, a significant difference was found, indicating that the distribution of ages differs significantly between the two groups. Similarly, BMI showed a significant difference, suggesting differing BMI distributions between groups. However, for the number of children, no significant difference was observed, indicating that the distribution of children does not vary significantly between the groups. The alternative hypothesis for all tests indicates that the true location shift between the groups is not equal to 0, suggesting meaningful differences (Visualisation has been shown in the boxplot Appendix 4).

As a result of the Chi-squared test (Table 7) for the sex variable with a p-value of 0.95, rejection of the Null hypothesis has failed to indicate there is no relationship between sex variables and the high or low-expense groups. For the smoker variable, the chi-squared test statistic 342.05 with a p<2.2e16, which is extremely small, indicates strong evidence against the null hypothesis, suggesting that there is a significant relationship between the variable smoker and the high or low-expense groups. Lastly, for the region variable, the chi-squared test statistic 4.466 with a p-value of 0.21 which is greater than 0.05, shows evidence of failure to reject the Null hypothesis (Visualisation has been shown in the bar graph Appendix 5).

Table 7: Results of statistical tests to assess differences in central tendencies
Appendix 4: Visualisation of the comparison of Age, BMI and Children with the newly created categorical variable EXPENSE_split, which has been split based on the high and low value of the expense variable
Appendix 5: Visualisation of the Association of Smoker, Sex and Region with newly created categorical variable EXPENSE_split, which has been split based on the high and low value of the expense variable.

Evaluation:

The statistical analyses revealed significant differences in Age and BMI distributions between high and low-expense groups, supported by Mann-Whitney U tests. However, the number of Children showed no significant difference. For categorical variables, while smoking habits were strongly associated with expense groups, Sex and region showed no significant associations. These findings suggest that Age, BMI, and Smoking habits are important factors influencing medical expenses, highlighting the need for tailored approaches in healthcare cost estimation and planning.

Conclusion

In conclusion, the analysis of the dataset reveals various factors influencing medical insurance expenses and health behaviours among the insured population. Demographic profiles, including age, gender, BMI, and smoking habits, provide insights into the characteristics of the insured individuals. The multiple linear regression model identifies significant predictors of medical expenses, highlighting the importance of age, BMI, smoking habits, and geographic region. Additionally, the independent predictor analysis emphasizes the impact of age, BMI, and smoking habits on medical expenses. These findings highlight the complexity of healthcare cost estimation and planning, stressing the need for tailored approaches and targeted interventions to address regional disparities in health outcomes. Overall, the dataset offers valuable insights for insurers, policymakers, and individuals to understand and manage healthcare costs effectively.

Note: This story is part of my university assignment report for the statistical data analysis and visualisation module.

--

--

Adrita G

I'm a Healthcare data analyst. I love to play with data using statistical approaches in Python and R.