Final Project Probability Course Pacmann: Health Insurance Case Study

Kinanti Nabilah
18 min readDec 3, 2022

INTRODUCTION

Background

Health insurance users are regularly required to pay a certain amount of money (premium) to the health insurance company. Premium was calculated by the insurance company to pay user’s health bill. Determining the premium value can be a challenge itself for the insurance company since there are many factors that can influence and increase users’ risk profile.

Objective

Through this project, analysis is conducted to understand the relationship between some variables depicting users’ condition with health bill received by each user.

DATASET

Dataset is personal data of users’ of the health insurance company. It contains 1338 entries with non-null and 7 variables, including:
- Age: age of primary beneficiary
- Sex:
insurance contractor gender, female, male
- BMI:
body mass index, providing and understanding of body, weights that are relatively high or low relative to height, objective index of body weight (kg/m2) using the ratio of height to weight, ideally 18.5 to 24.9
- Children
: number of children covered by health insurance/ number of dependents
- Smoker
: smoking
- Region
: the beneficiary’s residential area in the US, northeast, southeast, southwest, northwest
- Charges
: individual medical cost billed by health insurance

Figure 1. Dataset
Figure 2. Body Mass Index Categories (Source: https://www.freepik.com/premium-vector/body-mass-index-man-silhouettes-with-different-obesity-degrees_15808216.htm)

From Figure 2, according to BMI range, there are 5 distinct categories:
1. Underweight for BMI < 18.5
2. Normal for 18.5 BMI ≤ 24.9
3. Overweight for 25.0 ≤ BMI ≤ 29.9
4. Obese for 30.0 ≤ BMI ≤ 34.9
5. Extremely overweight for BMI ≥ 35.0

RESEARCH QUESTIONS

I. Descriptive Statistic Analysis

1. What is the users’ age average?
2. Is the users’ age average the same for female and male category?
3. What is the average of BMI value of users who smoke?
4. Which BMI value is higher, male or female?
5. Which BMI value is higher, smoker or nonsmoker?
6. Is the variance of charges value the same for smoker and nonsmoker?
7. Which average charge is higher, smoker or nonsmoker?
8. Which average charge is higher, a smoker whose BMI above 25 or nonsmoker whose BMI above 25?

II. Categorical Variable Analysis

1. Which proportion is higher, smoker or nonsmoker?
2. Which gender has the highest charge?
3. What is the probability of a female given the user is a smoker?
4. What is the probability of a male given the user is a smoker?
5. How is the probability distribution of charge in each region?
6. Does each region have users’ proportion equal to each other?

III. Continuous Variable Analysis

  1. Which one is more likely to happen?
    a. A user has BMI above 25 if they receive charges above 16.7k,
    or
    b. A user has BMI below 25 they receive charges above 16.7k
  2. Which one is more likely to happen?
    a. A user whose BMI below 25 will receive charges above 16.7k,
    or
    b. A user whose BMI below 25 will receive charges below 16.7k
  3. Which one is more likely to happen?
    a. A smoker whose BMI above 25 receives charges above 16.7k,
    or
    b. A nonsmoker whose BMI above 25 receives charges above 16.7k

IV. Variable Correlation Analysis

1. How is the correlation of BMI, age, and number of children with charges?
2. How is distribution of charge each based on smoker, sex, BMI and region category?

V. Hypothesis Testing

Is the hypothesis true for these claims?
1. Smoker users have higher charges than nonsmoker
2. Users whose BMI above 25 have higher charges than users whose BMI below 25
3. Male users have higher charges than female

FINDINGS AND ANALYSIS

The tools used for analyzing is Microsoft Excel for calculations and Tableau for visualizations.

I. Descriptive Statistic Analysis

First, we want to analyze the descriptive statistic of the data. Below is the summary table of the descriptive statistic we have calculated.

Figure 3. Summary Table of Descriptive Statistics

From the Figure 3 above, we can extract the values to answer the mentioned research questions explained below.

Total users
From the data, we can see the total number of users is 1338 where 662 are females and 676 are males and there are 1064 nonsmokers and 274 smokers from the total users.

Age
The average age of users is 39 years old with age range between 18 and 64 years old. Comparing the average age between female and male as well as smoker and no smoker, there is not much difference of the average age, which is around 38–39 years old.

BMI
As for average BMI value for all users is 30.66. While specifically, average BMI value for female is 30.38 and for male is 30.94 which has no big difference as well as the average BMI value for smoker is 30.71 and nonsmoker is 30.65. According to Figure 2 of categorization of BMI value, in average, users, whether male or female or whether smoker and nonsmoker fell into Obesity category (with BMI value between 30 and 34.)

Charges
Looking at the charges received or has to be paid by users regularly, female has slightly lower charges (12,570 USD) than male (13,957 USD). While, the difference of average charges are quite significant between nonsmoker (8,434 USD) and smoker (32,050 USD) since it is believed there is higher probability of increase risk profile both for male (compared to female), and smoker (compared to nonsmoker). If we aggregate the sex and smoker column hence comparing the average mean for smoker who is male to smoker who is female we also get the average mean higher for male smoker (33,042 USD compared to 30,679)

Comparing the variance or standard deviation value (the square root of variance) of charges for smoker and nonsmoker, we can see there is a quite significant difference. While nonsmoker only have standard deviation as much as 5,994 USD, the smoker category can have standard deviation up to 11,542 USD meaning the smoker category has wider range of charges with the maximum charge is 63,770 USD.

Figure 4: Comparison of Average Charge between Sex/Smoker

Then, comparing the average charges for smoker with BMI above 25 and nonsmoker with BMI above 25, there is a significant difference where the average charge of smoker can be up to almost 4.5 times more compared to nonsmoker’s (35,117 USD compared to 8,630 USD).

Key takeaways:
1. Average age of users is 39 years old
2. There is not much average age difference between female and male category, both around 38–39 years old.
3. The average of BMI value of users who smoke is 30.71
4. Comparing male and female, in average, male has slightly higher BMI (30.94) than female (30.38)
5. Comparing smoker and nonsmoker, in average, smoker has slightly higher BMI (30.71) than nonsmoker (30.65)
6. The variance of charges value is not the same same for smoker and nonsmoker, where smoker has higher variance (11,542 USD) compared to nonsmoker (5,994 USD).
7. Comparing smoker and nonsmoker, smoker has higher average of charge (32,050 USD) than nonsmoker (8,434 USD)
8. Comparing a smoker whose BMI above 25 or nonsmoker whose BMI above 25, a smoker whose BMI above 25 has higher average of charge (35,117 USD) which is almost 4.5 times higher than a nonsmoker whose BMI above 25 (8,630 USD).

II. Categorical Variable Analysis

Next, probability of certain conditions that has potential to increase the charge will be identified.

Comparison of Sex Category and Smoker Category

Figure 5. Number of Users based on Smoker Category
Figure 6. Number of Users based on Sex/Smoker Category
Figure 7. Summary Table of Smoker and Nonsmoker Proportions

From total users, 80% are nonsmoker while 20% are smoker. Then, we want to compare the probability of female who is smoker and male who is smoker. We calculate this using conditional probability using the formula below:

p (A | B) = p (A ∩ B) / p (B)

Then from the data we calculate:

Figure 8. Conditional Probability for Female and Male Given The User Is Smoker

From this calculations, we can conclude that it is more likely the smoker to be a male than female. We can also conclude that from the 20% of total users who is a smoker, number of male who is smoker is higher than female who is a smoker. From the previous table, we know that smoker has higher charges than nonsmoker. So, we can infer, comparing the gender, male category has higher probability to pay more charge than female category. (Even though, from the data, the maximum charge is from female user, we want to compare between the gender category, not the user who has maximum charge).

Comparison among Region

Figure 9. Comparison of Number of Users among Regions
Figure 9. Comparison of Number of Users among Regions based on Smoker Category
Figure 10. Comparison among Regions

Based on the data, the number of users among regions are similar, except Southeast which has most users in total and most smokers in total. Southeast region also has the highest mean charge among all regions. The two aspects, number of smokers and mean charge probably has correlations and since Southeast region has high number of smokers, it probably drive the mean charges value higher too.

Key takeaways:

1. Nonsmoker users has higher proportion (80%) than smoker users (20%)
2. Male has higher probability to have highest charge
3. The probability of a female given the user is a smoker is 42%
4. The probability of a male given the user is a smoker is 58%
5. The probability distribution of charge in Southeast is the highest (28%), followed by Northeast (25%), and Northwest and Southwest (23%)
6. Each region have users’ proportion equal to each other, except Southeast has the highest proportion (27%)

III. Continuous Variable Analysis

In this continuous variable analysis, we want to analyze and compare 3 comparisons as mentioned in the research question section above.

First, we want to analyze which one is more likely to happen:
a. A user has BMI above 25 if they receive charges above 16.7k,
or
b. A user has BMI below 25 they receive charges above 16.7k

From the data, we summarize into the BMI and charges category then we got Figure below.

Figure 11. Number of Users based on BMI and Charges Category
Figure 12. Number of Users based on BMI/Charges

Then, we calculate the probability of a user whose BMI above 25 receives a charges above 16.7k using conditional probability, here’s the summary of the calculations:

Figure 13. Conditional Probability of a User whose BMI above or below 25 Given the Charges above 16700 USD

From the calculations, the probability of someone will have BMI above 25 given they receive charges above 16,700 is 0.85, meanwhile the probability of someone have BMI below 25 given they receive charges above 16,700 is 0.15. From this, we can see, if there is a user who receive charges above 16,700, it is more likely that the user will have BMI above 25 (the probability is 85%) than have BMI below 25 (the probability is only 15%).

For the second question, we want to know which one is more likely to happen:
a. A user whose BMI below 25 will receive charges above 16.7k,
or
b. A user whose BMI below 25 will receive charges below 16.7k

Using the same data from Figure 12, we can calculate using conditional probability. Here is the summary of the calculation:

Figure 14. Conditional Probability of User who has Charges above or below 16700 USD given the user’s BMI below 25

From the calculation above, the probability of a user will receive charge below 16700 given the user’s BMI is below 25 is more likely (the probability of 79%) than the probability of a user will receive charge above 16700 given the same BMI condition (probability is 21%).

For the third question, we want to compare which one is more likely to happen between:
a. A smoker whose BMI above 25 receives charges above 16.7k,
or
b. A nonsmoker whose BMI above 25 receives charges above 16.7k

First, we summarize the number of users based on the variable we will be calculating:

Figure 15. Number of Users based on Smoker/BMI/Charge Category

Then, we calculate the probability of a smoker whose BMI above 25 will receives charges above 16.7k and anonsmoker whose BMI above 25 receives charges above 16.7k. Here’s the summary of the calculations:

Figure 16. Conditional Probability of a User has Charges above or below 16700 USD given the User is Smoker and has BMI above 25

From the table above, we can compare, smoker whose BMI above 25 will be more likely to be charged above 16700 (the probability is 98%).

Figure 17. Conditional Probability of a User has Charges above 16700 USD given the User is a Nonsmoker and has BMI above 25

From the table above, a nonsmoker whose BMI above 25 will be more likely to be charged below 16700 (the probability is 92%).

Figure 18. Conditional Probability of a User has Charges above 16700 USD given the User is Smoker and has BMI above 25 or given the User is Nonsmoker and has BMI above 25

If we compare the smoker and nonsmoker whose both BMI above 25, it is more likely that the smoker with BMI above 25 will have charge above 16700 (the probability is 98%) than the nonsmoker with BMI above 25 will have charge above 16700 (the probability is 8%). We can infer if a user has the same BMI category (above 25), the smoker status will have tremendous influence to determine whether the user will have charge above 16700 or not.

Key Takeaways:

  1. A user has BMI above 25 if they receive charges above 16.7k is more likely to happen (the probability is 85%) compared to a user has BMI below 25 they receive charges above 16.7k (probability is 15%). It means a user who receive charges above 16.7k is more likely to have BMI above 25 than to have BMI below 25.
  2. A user whose BMI below 25 is more likely will receive charges below 16.7k (the probability is 79%) than receive charges above 16.7k (the probability is 21%). It means if a user’s BMI below 25, they are more likely to receive charges below 16.7k than above 16.7k.
  3. A smoker whose BMI above 25 is more likely to receive charges above 16.7k (the probability is 98%) than a nonsmoker whose BMI above 25 (the probability is 8%). It means if a user has BMI above 25, their smoker/nonsmoker status will have tremendous influence to determine whether the user will have charge above 16700 or not.

IV. Variable Correlation Analysis

Next, we want to know the correlation of variable of BMI, age, and number of children with charges. Below is the correlation graphs and correlation values of the mentioned variables with charges.

Figure 19. Correlation Graph and Value between BMI and Charges
Figure 20. Correlation Graph and Value between Age and Charges
Figure 21. Correlation Graph and Value between Number of Children and Charges
Figure 22. Summary Table of Correlation Value

From the processed data above, we can conclude that there is a positive correlation between BMI and charges, age and charges, and number of children and charges. Positive correlation means in correlated data, the change in the magnitude of one variable is associated with a change in the magnitude of another variable, in the same direction (where one increases, the other will also increase or where one decreases, the other will also decrease). In other words, if the value of BMI, age, and number of children increasing, it will followed by the increasing value of charges.

Figure 23. The Interpretation of Correlation Coefficient (Source: https://journals.lww.com/anesthesia-analgesia/fulltext/2018/05000/correlation_coefficients__appropriate_use_and.50.aspx)

However, although the correlation is positive, according to the reference, the correlation between BMI and charge and between age and charge is weak, whereas the correlation between number of children and charge is negligible. It means, in the most data points we have, where one variable increases, the other variable will increse but not necessarily with the same increment rate or in other word, in a weak and unreliable manner.

Next, we want to analyze how is distribution of charge each based on smoker, sex, bmi and region category.

Figure 24. Charge Distribution based on Smoker Category

Based on the Figure 24 above, we can compare the charge distribution based on smoker category. Smoker has higher median meaning charges received by users who smoke is, in average, higher than nonsmoker. Smoker also has wider variance meaning the range difference of charge value is wider. Both category has right-skewed shape of distribution. In nonsmoker category we can see several data as outliers (outside the interquartile range) higher than 1.5 times of Q2).

Figure 25. Charge Distribution based on Sex Category

Based on the Figure 25 above, we can compare the charge distribution based on sex category. We can see both female and male category have right-skewed distributions. The median of both female and male are similar with male having wider variance of charge.

Figure 26. Charge Distribution based on Region

Based on the Figure 26, we can see every region has right-skewed distributions. The median is similar among region. The variance are similar except the Southeast region which variance are the most wide.

Key takeaways:

1. The correlation value between BMI and charges is 0.198, between age and charges is 0.298 and number of children and charges is 0.067. The correlation is positive but the value is so low that it is considered to have a weak correlation (for BMI — charges and age — charges relationship) and even negligible (for children — charges relationship).
2. The distribution of charge based on smoker, sex, BMI and region category is different for each. Smoker has higher median and higher variance than nonsmoker. Male has similar median but slightly higher variance than female. Each region has similar median but Southeast region has the highest variance.

V. Hypothesis Testing

Hypothesis testing in statistics is a way for you to test the results of a survey or experiment to see if you have meaningful results. You’re basically testing whether your results are valid by figuring out the odds that your results have happened by chance. If your results may have happened by chance, the experiment won’t be repeatable and so has little use.

In this section, we want to analyze 3 hypothesis.

First, we want to analyze whether or not smoker users have higher charges than nonsmoker.

Step one, we state the H0 and H1.

H0: μ smoker μ nonsmoker
H1: μ smoker > μ nonsmoker

Then, using Data Analysis Tools in Excel, we compare the data of charges between smoker and nonsmoker. The test we use is t-Test and since we already knew that the variance between smoker and nonsmoker variable is different, we chose t-Test: Two- Sample Assuming Unequal Variances. Since we are looking for the value that is “greater than” other value, we will use one-tail test. The result table shown below.

Figure 27. Hypothesis Testing for Case 1

From the table above, we can see the Pvalue (2.79E-103) is less than 0.05. Therefore, it is significant enough to reject the null hypothesis and it is sufficient evidence to state that the charge of smokers is higher than nonsmokers.

Second, we want to analyze whether or not users whose BMI above 25 have higher charges than users whose BMI below 25.

We state the H0 and H1.

H0: μ BMI>25 μ BMI≤25
H1: μ BMI>25 > μ BMI≤25

Using the same methods like the previous case above, the result table shown below.

Figure 28. Hypothesis Testing for Case 2

From the table above, we can see the Pvalue (4.3245E-272) is also less than 0.05. Therefore, it is significant enough to reject the null hypothesis and it is sufficient evidence to state that users whose BMI above 25 have higher charges than users whose BMI below 25.

Last, we want to analyze whether or not male users have higher charges than female.

We state the H0 and H1.

H0: μ male μ female
H1: μ male > μ female

Using the same methods like the previous case above, the result table shown below.

Figure 29. Hypothesis Testing for Case 3

From the table above, we can see the Pvalue (0.0358) is also less than 0.05. Therefore, it is significant enough to reject the null hypothesis and it is sufficient evidence to state that male users has higher charge than female.

Key takeaways:

From the hypothesis testing using Data Analysis Tool in Excel and by using one-tail t-test assuming unequal variances, we can conclude:
1. Smoker users have higher charges than nonsmoker
2. Users whose BMI above 25 have higher charges than users whose BMI below 25
3. Male users have higher charges than female

CONCLUSION

Here are the conclusion we get from the analysis:

I. Descriptive Statistic Analysis
1. Average age of users is 39 years old
2. There is not much average age difference between female and male category, both around 38–39 years old.
3. The average of BMI value of users who smoke is 30.71
4. Comparing male and female, in average, male has slightly higher BMI (30.94) than female (30.38)
5. Comparing smoker and nonsmoker, in average, smoker has slightly higher BMI (30.71) than nonsmoker (30.65)
6. The variance of charges value is not the same same for smoker and nonsmoker, where smoker has higher variance (11,542 USD) compared to nonsmoker (5,994 USD).
7. Comparing smoker and nonsmoker, smoker has higher average of charge (32,050 USD) than nonsmoker (8,434 USD)
8. Comparing a smoker whose BMI above 25 or nonsmoker whose BMI above 25, a smoker whose BMI above 25 has higher average of charge (35,117 USD) which is almost 4.5 times higher than a nonsmoker whose BMI above 25 (8,630 USD).

II. Categorical Variable Analysis
1. Nonsmoker users has higher proportion (80%) than smoker users (20%)
2. Male has higher probability to have highest charge
3. The probability of a female given the user is a smoker is 42%
4. The probability of a male given the user is a smoker is 58%
5. The probability distribution of charge in Southeast is the highest (28%), followed by Northeast (25%), and Northwest and Southwest (23%)
6. Each region have users’ proportion equal to each other, except Southeast has the highest proportion (27%)

III. Continuous Variable Analysis

  1. A user has BMI above 25 if they receive charges above 16.7k is more likely to happen (the probability is 85%) compared to a user has BMI below 25 they receive charges above 16.7k (probability is 15%). It means a user who receive charges above 16.7k is more likely to have BMI above 25 than to have BMI below 25.
  2. A user whose BMI below 25 is more likely will receive charges below 16.7k (the probability is 79%) than receive charges above 16.7k (the probability is 21%). It means if a user’s BMI below 25, they are more likely to receive charges below 16.7k than above 16.7k.
  3. A smoker whose BMI above 25 is more likely to receive charges above 16.7k (the probability is 98%) than a nonsmoker whose BMI above 25 (the probability is 8%). It means if a user has BMI above 25, their smoker/nonsmoker status will have tremendous influence to determine whether the user will have charge above 16700 or not.

IV. Variable Correlation Analysis

  1. The correlation value between BMI and charges is 0.198, between age and charges is 0.298 and number of children and charges is 0.067. The correlation is positive but the value is so low that it is considered to have a weak correlation (for BMI — charges and age — charges relationship) and even negligible (for children — charges relationship).
  2. The distribution of charge based on smoker, sex, BMI and region category is different for each. Smoker has higher median and higher variance than nonsmoker. Male has similar median but slightly higher variance than female. Each region has similar median but Southeast region has the highest variance.

IV. Hypothesis Testing

From the hypothesis testing using Data Analysis Tool in Excel and by using one-tail t-test assuming unequal variances, we can conclude:
1. Smoker users have higher charges than nonsmoker
2. Users whose BMI above 25 have higher charges than users whose BMI below 25
3. Male users have higher charges than female

FURTHER RESEARCH

Further research can be conducted to understand the relationship between charges and other variables that are not included here but could also be important to determine the users’ charges, such as pre-existing illness, medical history, users’ lifestyle, and other similar variables.

REFERENCE

https://statisticsbyjim.com/basics/skewed-distribution/

For this case study, you can access the Excel File here and explore the Tableau dashboard here.

Let’s connect!

LinkedIn: Kinanti Nabilah

--

--