Predicting Diabetic Patient Resubmission

Rishab Das
The Deep Hub
Published in
8 min readApr 19, 2024

The goal of this project is to predict the return (resubmission) of daibetic patients to hospitals. Predicting patient return is extremely helpful in saving lives, and saving time and money for hospitals around the world. Diabetes is one of the most prominent diseases in the world today and has been for last century. Diabetic patients have a different set of problems than that of normal patients, requiring more tests and diagnosis's, resulting in higher spending on that patient and more time (that could be predicted) spent on that patient. The lives of diabetic patients and the doctors that treat them can be improved drastically if they both knew what treatments led to a lower probability of return.

The dataset I am using the Diabetes 130 US hospitals for years 1999–2008 found on Kaggle.com. The dataset represents 10 years (1999–2008) of clinical care at 130 US hospitals and integrated delivery networks. The data contains such attributes as patient number, race, gender, age, admission type, time in hospital, medical specialty of admitting physician, number of lab test performed, HbA1c test result, diagnosis, number of medication, diabetic medications, number of outpatient, inpatient, and emergency visits in the year before the hospitalization, etc. This is the original link for the dataset.

With a large amount of data, their are a few steps to complete:

  • Clean the dataset
  • Reduce the amount of data in the dataset
  • Find an efficient algorithm to predict patient resubmission

Cleaning The Dataset / Preprocessing

Here the steps to complete:

  • Load in the dataset and display it
df = pd.read_csv("diabetic_data.csv")
df.head()
  • Find out any null values
df.isnull().sum()
Null Values in these columns

If we display the head of the dataset, we see that every row has weight missing. We should just drop it, there is no way for us to engineer the weight feature.

Weight Column has no values
df.drop(["max_glu_serum","A1Cresult", "weight"], axis = 1, inplace = True)

Now that we have a clean dataset, it is time for us to make the data usable. Looking at the data, there are a lot of categorical variables. They are also strings, and we need to change these strings into numbers so that our ML algorithm can work with the dataset. For this, we will use the LabelEncoder from sklearn. This simply just encodes the categorical variables with numbers. But in order to so we need to find which columns are categorical.

# Categorical Variables have a dtype of "object", so we just find if the dtype of the object is equal to "object"
categorical_cols = [cname for cname in df.columns if df[cname].dtype == "object"]
print(categorical_cols)
Categorical Columns (not all of them)

Now that we have all the categorical columns, we can just apply the LabelEncoder to each column using a for loop.

from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
for col in categorical_cols:
df[col] = le.fit_transform(df[col])

We can now see that the variables are now usable by an ML algorithm.

Newly Encoded Columns
Previously Not Encoded Columns

We have a lot of data and we can reduce it, but we want to grab the most important features and drop the features that don’t affect anything at all, taking one look at the dataset we can see some ones that are really obviously sticking out.

Columns that don’t help

These columns don’t help us whatsoever. So, instead of keeping them, we should just delete them. Erase them from existence.

df.drop(["encounter_id", "patient_nbr", "admission_type_id", "discharge_disposition_id", "admission_source_id"], axis = 1, inplace = True)

Now we see that the data looks like this. This is our final dataset, the one that we will use an ML algorithm on.

The Final Dataset

Now we will apply our machine learning algorithm of choice, but before we do this, we should find out which features are the most important, we will take the 5 most important features. How do we do this? We will use LightGBMs feature importance plot. This plot is super helpful and a super simple way for us to figure out the most important features in the dataset. But how does this work? Before I explain how it works, I will show you the code to implement it. We will use the lightGBMClassifier since this is a classification problem.

import lightgbm as lgbm


X = df.drop("readmitted", axis = 1)
y = df[["readmitted"]]

model = lgbm.LGBMRegressor(n_estimators = 500)

model.fit(X, y)

lgbm.plot_importance(model, importance_type="gain")
Feature importance plot

So how does this work, what does importance_type=gain mean?

In simple terms Light GBM calculates feature importance by examining how much each feature contributes to reducing the error when making predictions. It does this by repeatedly splitting the data based on different features and evaluating how well these splits improve the model’s performance. Features that lead to the largest improvements in the model’s accuracy or predictive power are considered more important.

What does importance_type=gain mean? What does n_estimators = 500 mean? In Light GBM, importance_type=gain refers to the method used to calculate feature importance. Specifically, it calculates feature importance based on the total gain of each feature over all the splits in the trees of the model. Regarding n_estimators=500, this parameter determines the number of decision trees (estimators) that will be used in the Light GBM model. Each tree contributes to the final prediction, and having more trees generally leads to a more accurate model, up to a certain point. So, setting n_estimators=500 means that the model will consist of 500 decision trees. Increasing the number of decision trees could lead to bad returns and an increase in the amount of computing power required.

We can see from the feature importance plot that the 5 most important features that we are going to use are number_inpatient, diag_1, num_lab_procedures, diag_2, diag_3. The number_inpatient feature means the number of times the patient has visited the hospital in the previous year. This makes sense that this is by far the most important feature. The number of times visited over the last year is the biggest predictor of the future.

Machine Learning Algorithm

We will now select our machine learning algorithm. But we have a few more steps, this is just preparing our data:

  • Selecting those features and saving it in our X variable
X = df[["number_inpatient", "diag_1", "num_lab_procedures", "diag_2", "diag_3"]]
y = df[["readmitted"]]
  • Split into training and testing

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

Our selected machine learning algorithms is Gradient Boosting Machine (GBM). GBM is an ensemble learning method that is built on mutiple weak models. Usually, these models are decision trees. The models are built sequentially. Sequentially means that they are built step-by-step. It observes the mistakes made, gives itself tips on how to fix them, and then it tries again. In GBM, the weak-learner (the decision trees), correct the errors in the combined iterations of the previous weak-learners.

Some of the benefits of GBM include:

  • Building Weak Learners (the multiple weak learners builds one strong ensemble learner)
  • Fitting the Residuals (the model identifies the errors made by the weak learners and focuses on reducing these errors)
  • Sequential Learning (it learns step-by-step, fixings its mistakes in the previous steps)
  • Good with complex data relationships
  • Less Prone to Overfitting
  • Can Handle missing data

Some of the negatives of GBM include:

  • Can requires high computational costs
  • Requires careful hyper-parameter tuning to avoid overfitting

Let us apply this ensemble method to our data.

from sklearn.ensemble import GradientBoostingClassifier

model = GradientBoostingClassifier(n_estimators=500)
model.fit(X_train, y_train)

This training took a full minute, and the metrics measuring how well the model has preformed were not good.

from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix, classification_report

y_pred = model.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred, average='weighted')
recall = recall_score(y_test, y_pred, average='weighted')
f1 = f1_score(y_test, y_pred, average='weighted')

print("Accuracy:", accuracy)
print("Precision:", precision)
print("Recall:", recall)
print("F1-Score:", f1)

print(classification_report(y_test, y_pred))
Performance Metrics

So how do we make this better? We should play with the hyperparameters of the GBM model. We will change the depth of the trees to just be 6 and the learning rate to be 0.01.

from sklearn.ensemble import GradientBoostingClassifier

model = GradientBoostingClassifier(n_estimators=500, max_depth=6, learning_rate=0.01)
model.fit(X_train, y_train)

y_pred = model.predict(X_test)

from sklearn.metrics import accuracy_score
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy of the model:", accuracy)

We still get a low accuracy

Accuracy Results

We will now try a different gradient boosting model, the Histogram based Gradient booster and the AdaBoost booster. Both of these have the same underlying concepts, just do them in a different way.

from sklearn.ensemble import HistGradientBoostingClassifier

X = df[["number_inpatient", "diag_1", "num_lab_procedures", "diag_2", "diag_3"]]
y = df[["readmitted"]]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

model = HistGradientBoostingClassifier(max_depth=6, learning_rate=0.01)
model.fit(X_train, y_train)

y_pred = model.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)
print("Accuracy of the model:", accuracy)
from sklearn.ensemble import AdaBoostClassifier

X = df[["number_inpatient", "diag_1", "num_lab_procedures", "diag_2", "diag_3"]]
y = df[["readmitted"]]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

model = AdaBoostClassifier(n_estimators=500, learning_rate=0.01)
model.fit(X_train, y_train)

y_pred = model.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)
print("Accuracy of the model:", accuracy)

The accuracy is still not good, for both, they are still hovering around the 50% mark. Why? This dataset is very dense, the features we chose even tough they have the highest amount of influence on the results, the million other factors that contribute to the returning of patients is unpredictable. For now, I will end this project with the understanding that while predicting patient resubmission is very hard, it is still possible. While there are many many algorithms to try out there, the implementation is the exact same, and while I have no tested them, I have a feeling they will yield the same results.

This project has a moral question to. Do people want to trust an alogrithm at all? Do they want to trust it with the possibilities of returning to a hospital or not? Do they want their doctors to change based on the results that are predicted by the algorithm?

But alas, while the results we wanted to get were not found, the amount of fun I had diving into the world of patient resubmission and data analysis was unmatched. Thank you!!!

--

--