Predictive Analysis on Discharge Cost of Patients in R

In this article, we will be looking at the process of predicting the discharge cost of a patient based on Age, Gender, and Length of Stay using Multiple Regression Analysis in R studio.

Shruti Patkar
R-evolution
5 min readJan 14, 2023

--

Photo on Philips.com

For the analysis, I have taken a dataset from Kaggle which is based on hospital records of inpatients in Wisconsin state of USA. The dataset contains 6 parameters which are Age, Gender, Race of the patient, Length of stay, Discharge cost, and Diagnosis-related groups.

The link to the dataset is given below here

In the above dataset, gender has been allotted a binary value of 0 for Male and 1 for Female. For easier understanding, I have replaced the 0 as ‘Male’ and 1 as ‘Female’ using Replace function in Excel. Now the dataset is ready for importing in R.

library(readr)
discharge_cost <- read_csv("discharge cost.csv")
View(discharge_cost)

In the above code, I have loaded readr package to read the dataset, saved it as discharge_cost, and used View() function for viewing the same.

After uploading the dataset, we will run the following codes to understand the dataset and structure of the data variables.

str(discharge_cost)
summary(discharge_cost)
table_gender<- table(discharge_cost$GENDER)
table_gender

Now I have used str() function to understand the structure of the data values and the summary() function to summarize the data values. Below this, I have created an object- table_gender to view the total number of Female & Male patients.

After running the above code, we can say the following things about the dataset:

· Age group of the patient is from 0–17 years

· Number of female patients is 255 and male is 244

· Race of the patient is specified on a numerical value of 1- 6

· Minimum discharge value is 532 and maximum discharge value is 48388

· Minimum LOS is 0 and the maximum value is 41 days

As we have understood the structure of the data values, we will now run a multiple linear regression model on the dataset:

model1 <- lm(TOTCHG ~ AGE + GENDER + LOS, data = discharge_cost)
summary(model1)

I have used lm() function for creating a linear regression model in which TOTCHG (discharge cost) is the predictor variable and Age, Gender, LOS are the response variables.

Call:
lm(formula = TOTCHG ~ AGE + GENDER + LOS, data = discharge_cost)

Residuals:
Min 1Q Median 3Q Max
-4363 -1115 -637 142 41639

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -411.60 258.13 -1.595 0.111456
AGE 115.57 19.52 5.921 6e-09 ***
GENDERMale 1022.15 270.77 3.775 0.000179 ***
LOS 742.29 39.20 18.935 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 2934 on 495 degrees of freedom
Multiple R-squared: 0.435, Adjusted R-squared: 0.4316
F-statistic: 127.1 on 3 and 495 DF, p-value: < 2.2e-16

From the above summary, we can understand that:

· P-value is <2.2e-16, which is less than 0.05 and T-value is also away from 0, which means that there is a significant relationship between the predictor variable (discharge cost) and response variables (Age, Gender and LOS)

· Multiple R-squared values is 0.435, which means that only 43% of the variation in discharge cost can be explained by the response variables

· Residual standard error states that we can expect a change of 2934$ in discharge cost from the original discharge cost

· Increase of one year in age will increase the discharge cost by 115.57 $

· A male patient will increase the discharge cost by 1022.15 $

· One day increase in LOS will increase the discharge cost by 742.29 $

Now we will use predict() function, to predict the discharge cost of a patient

predict(model1)

After running the above function, R gives us the list of predicted values for each row.

Now we will predict the values by giving specific input values.

Here we want our model to predict the discharge cost where Age is 17, the Gender of the Patient is Male and the Length of Stay is 7 days.

Let’s see the result:

predict(model1, tibble(AGE = 17, GENDER = 'Male', LOS = 7))
1
7771.362

The model is giving us the predicted value of discharge cost as 7771$.

Now to cross-check the predicted values with the real values given in the dataset, we will create a data frame of predicted value and will attach the same data frame to our dataset for better understanding.

predicted_values<- data_frame(predict(model1))
discharge_cost$predicted <- predicted_values

Now if we will compare the predicted values and real values we can see that there is a major difference in many of the predicted values. This is because as we have seen above in the summary of lm() function that the given model only explains 43% of the variations in the discharge cost and standard error is also expected as 2934$ which is again high.

So to understand the reason behind the difference between real values and predicted values we will visualize the data elements in R using ggplot2. To do so, we will first install the package and load it in R using the following code.

install.packages("ggplot2")
library(ggplot2)

Now we will draw some visualizations:

ggplot(data = discharge_cost,aes(x = AGE, y = TOTCHG, col = GENDER)) +
+ geom_point()
Discharge Cost vs Age
ggplot(data = discharge_cost,aes(x = LOS, y = TOTCHG, col = GENDER)) +
+ geom_point()
Discharge cost vs LOS
hist(discharge_cost$AGE)
Histogram explaining frequency of Age
hist(discharge_cost$LOS)
Histogram explaining frequency of LOS
hist(discharge_cost$TOTCHG)
Histogram explaining frequency of discharge cost

From the above plots, we can clearly see that although there is a strong significance between the data points, they are not linearly distributed and because of this, the model is explaining a low number of possible variations in discharge cost.

So here we can conclude that:

· To predict the discharge cost more accurately there has to be a linear distribution of data points

· If it is not possible to find linear distribution in the given data points, we need to look for other parameters to predict the discharge cost of a patient more effectively

The complete code of Analysis is available on my GitHub Profile. The link to the profile is here

As this was my first project on Predictive Analysis, any suggestions & recommendations are always welcome:)

Thank You

More content about R at medium.com/r-evolution. Follow us and sign up for our free weekly newsletter.

--

--

Shruti Patkar
R-evolution

Data Enthusiast in Healthcare | SQL | Excel | Power BI | R