Analytics Vidhya
Published in

Analytics Vidhya

Building Machine Learning model to predict if the patient will be readmitted within 30 days

In the USA, around 10% of the patients are readmitted again in the hospital within 30 days of getting discharge from hospital. How good it would be if we know beforehand if we are going to be one of those 10% ? ;)

In this blog, I am explaining the machine learning model I built, to predict exactly the same i.e. given medical history of patient, will the patient need readmission or not.

Table of contents :

  1. Machine Learning Problem Formulation

1.1 Introduction
— 1.2 Business Problem
— 1.3 Business Constraint
— 1.4 Data Set analysis
— 1.5 Performance Metrics

2. Data Cleaning and Preprocessing

— 2.1 Data Cleaning
— 2.2 Data Preprocessing

3. Exploratory Data Analysis

3.1 Univariate analysis of A1C Test Results
— 3.2 Univariate analysis of number of lab procedures
— 3.3 Univariate analysis of number of procedures
— 3.4 Univariate analysis of number of inpatient history
— 3.5 Univariate analysis of number of medications prescribed
— 3.6 Univariate analysis of age
— 3.7 Univariate analysis of gender
— 3.8 Bivariate analysis of age and time spent in hospital
— 3.9 Bivariate analysis of gender and number of diagnosis
— 3.10 Bivariate analysis of race and number of diagnosis
— 3.11 Bivariate analysis of age and gender

4. Feature Engineering

— 4.1 Checking for Data imbalance
— 4.2 Forward feature selection
— 4.3 Implementing SMOTE for oversampling to compensate for data imbalance

5. Implementing ML algorithms and finding and comparing their performance
— 5.1 Implementing Boosting algorithms
— 5.2 Creating custom ensemble model
— 5.3 Comparing all ML models

6. Final model training and saving the same
— 6.1 Model Deployment
— 6.2 Further Scope for Improvement
— 6.3 References

1. Machine Learning Problem Formulation

1.1 Introduction :

Using the data collected from 130 hospitals across the USA over the period of 10 years 1999–2008, I tried to build a predictive ML model which would predict how likely the patient is for getting readmitted to the hospital. For making this prediction, I have taken into consideration factors such as number of inpatient, number of diagnosis, number of emergency admits and similar attributes about the patient’s medical history. In addition to this, I have also considered the medical tests conducted, their results and the drugs prescribed. Using this data I was able to build an ML model that predicts if the patient will be readmitted within 30 days of discharge.

1.2 Business Problem :

As a part of Centers for Medicare & Medicaid Services (CMD) — Hospital Readmissions Reduction Program (HRRP), hospitals are reimbursed for the medical care they provide to patients, taking into consideration hospital readmission rates. That is, hospitals with higher readmission rates than expected(average) are penalized by cutting down on the financial reimbursement given to them.

But, for such penalized hospitals, maintaining a low readmission rate of patients is a challenge as there is no way to determine which patient will be readmitted to hospital and otherwise. Those hospitals that are penalized under HRRP, therefore, need a predictive model that would help them determine which patient is likely to be readmitted so that, accordingly, they can provide additional healthcare facilities and modified prescriptions that would prevent the readmission of the patient and thus maintain a low readmission rate. In this case study, using data science, we will create a model that would do the same i.e. predict the likelihood of patient getting readmitted to the hospital.

1.3 Business Constraints :

  1. Interpretability : As the model built might need human intervention or manual reviewing for some cases as the business problem is from the medical domain and so tolerance towards error is almost zero.
  2. Class probabilities are needed : Class probabilities will enable us to determine how well the model is able to distinguish between patients who need readmission and otherwise.
  3. No latency requirements : As the model will be run offline and will have ample time to execute and provide results, extreme low latency is not required.

Objective : To determine the probability of patient getting readmitted to hospital within the first 30 days of diagnosis so that doctors can take extra care of the patient and prevent readmission.

1.4 Data set analysis :

Data collected by : Center for Clinical and Translational Research, Virginia Commonwealth University

Description :

Source : This dataset is available here.

The dataset represents 10 years (1999–2008) of clinical care at 130 US hospitals and integrated delivery networks. It includes over 50 features representing patient and hospital outcomes. Information was extracted from the database for encounters that satisfied the following criteria.

(1) It is an inpatient encounter (a hospital admission).

(2) It is a diabetic encounter, that is, one during which any kind of diabetes was entered to the system as a diagnosis.

(3) The length of stay was at least 1 day and at most 14 days.

(4) Laboratory tests were performed during the encounter.

(5) Medications were administered during the encounter.

The data contains such attributes as patient number, race, gender, age, admission type, time in hospital, medical speciality of admitting physician, number of lab test performed, HbA1c test result, diagnosis, number of medication, diabetic medications, number of outpatient, inpatient, and emergency visits in the year before the hospitalization, etc.

Features information :

Encounter ID: Unique identifier of an encounter

Patient number: Unique identifier of a patient

Race Values: Caucasian, Asian, African American, Hispanic, and other

Gender Values: male, female, and unknown/invalid

Age: Grouped in 10-year intervals: [0, 10), [10, 20), …, [90, 100)

Weight: Weight in pounds grouped in 25-pound intervals [0–25), [25–50),…,>200

Admission type: Integer identifier corresponding to 8 distinct values, for example, emergency, urgent, elective, newborn, and not available

Discharge disposition: Integer identifier corresponding to 26 distinct values, for example, discharged to home, expired, and not available

Admission source: Integer identifier corresponding to 17 distinct values, for example, physician referral, emergency room, and transfer from a hospital

Time in hospital: Integer number of days between admission and discharge ranging from [1–14]

Payer code : Integer identifier corresponding to 18 distinct values, for example, Blue Cross/Blue Shield, Medicare, and self-pay Medical

Medical speciality: Integer identifier of a speciality of the admitting physician, corresponding to 73 distinct values, for example, cardiology, internal medicine, family/general practice, and surgeon

Number of lab procedures: Number of lab tests performed during the encounter

Number of procedures: Numeric Number of procedures (other than lab tests) performed during the encounter ranging in [0–6]

Number of medications: Number of distinct generic names administered during the encounter

Number of outpatient visits: Number of outpatient visits of the patient in the year preceding the encounter

Number of emergency visits: Number of emergency visits of the patient in the year preceding the encounter

Number of inpatient visits: Number of inpatient visits of the patient in the year preceding the encounter

Diagnosis 1: The primary diagnosis (coded as first three digits of ICD9); 717 distinct values

Diagnosis 2: Secondary diagnosis (coded as first three digits of ICD9); 749 distinct values

Diagnosis 3: Additional secondary diagnosis (coded as first three digits of ICD9); 790 distinct values

Number of diagnoses : Number of diagnoses entered to the system ranging in [1–16]

Glucose serum test : result indicates the range of the result or if the test was not taken. Values: “>200,” “>300,” “normal,” and “none” if not measured

A1c test result : Indicates the range of the result or if the test was not taken. Values: “>8” if the result was greater than 8%, “>7” if the result was greater than 7% but less than 8%, “normal” if the result was less than 7%, and “none” if not measured.

Change of medications : Indicates if there was a change in diabetic medications (either dosage or generic name). Values: “Ch” and “No”

Diabetes medications : Indicates if there was any diabetic medication prescribed. Values: “yes” and “no” 24 features for medications For the generic names:

Medications : ‘metformin’, ‘repaglinide’, ‘nateglinide’, ‘chlorpropamide’, ‘glimepiride’, ‘acetohexamide’, ‘glipizide’, ‘glyburide’, ‘tolbutamide’, ‘pioglitazone’, ‘rosiglitazone’, ‘acarbose’, ‘miglitol’, ‘troglitazone’,’tolazamide’, ‘examide’, ‘citoglipton’, ‘insulin’, ‘glyburide-metformin’, ‘glipizide-metformin’, ‘glimepiride-pioglitazone’, ‘Metformin-rosiglitazone’, ‘metformin-pioglitazone’. These medications are indicated with one column each with values as ‘up’, ‘steady’, ‘no’ or ‘down’ indicating if medications were increased, kept constant, stopped or decreased respectively.

Readmitted: Days to inpatient readmission. Values: “0” if the patient was readmitted in less than 30 days, “>30” if the patient was readmitted in more than 30 days, and “No” for no record of readmission

1.5 Performance Metrics :

  1. Primary performance metric : AUC
  2. Secondary performance metrics : confusion matrix and recall

Why AUC?

As our model is outputting probabilistic scores, AUC helps us identify how well the model is able to distinguish between target labels especially when the target label is probabilistic value.

Why recall and confusion matrix ?

Our objective is to predict whether the patient will be readmitted or not. It is okay, if a patient who will not be readmitted is predicted by model that it will be readmitted i.e. it is okay if False positive rate is high. However, FN rate must be low this is so because, a patient who will be readmitted should not go unnoticed as this will lead to eventually hospital getting penalized. Our objective is to prevent the hospital from getting penalized. Recall metric correctly summarizes our objective as it is based on True positives and False Negatives. Therefore, high recall rate is expected. Confusion matrix on other hand helps us to keep a check on the trade-off between high FN rate and FP rate.

2. Data Cleaning and Preprocessing

2.1 Data Cleaning :

Before using the data it is necessary to clean the data for getting rid of redundant data. Features ‘encounter_id’ and ‘patient_id’ correspond to the IDs given to medical encounter id and patient’s id respectively. Same patient, might have visited the hospital multiple times, therefore, each encounter id is unique, but patient id can occur multiple times. Anyway, we are going to get rid of both these columns as they are merely identifier of personal information and serve no other purpose. But, before that, we will check for rows where patient_id is repeated with the same readmission status, and we will drop such rows. Because these rows will be the same in most of the aspects thus making them redundant.

checking for redundant rows using encounter_id and patient_nbr

Columns ‘examide’, ‘glimepiride-pioglitazone’ and ‘citogliption’ have only 1 unique value that means, all the values in those columns are same and have 0 variance and hence don’t make any contribution towards finding any meaningful pattern in data. So we will drop them. On similar lines, Columns “metformin-rosiglitazone”, “metformin-pioglitazone” and ‘acetohexamide’ have 2 values each but the minority value has only 1 or 2 rows, which is not sufficient for ML model to find any useful pattern. Therefore these columns too will be dropped.

checking for value counts of features

Discharge disposition ids 11,19,20,21 indicate that the patient has expired, needless, to say, these patients will never be readmitted and therefore better to remove from the dataset.

Gender feature has 3 values namely male, female and unknown/invalid, the third having value count of only 3 rows, so we will drop these 3 rows where gender is unknown or invalid.

Features ‘Discharge_disposition_id’, ‘admission_source_id’ and ‘admission_type_id’ have several different values which actually represent ‘Unavailable’, ‘null’ or ‘invalid’ data, so we will group them together.

2.2 Data Preprocessing :

Feature age is presented as categorical feature values with buckets of size 10 each. These can be better represented by the median age value of each group.

Distribution of age — grouped in buckets of 10 and represented as categorical feature

Target feature ‘readmitted’ has 3 values, that is patients that were readmitted after 30 days, before 30 days and those that were not readmitted at all. As our objective is to determine whether the patient was readmitted within 30 days or not let’s convert it into binary feature having 2 values that is readmission before 30 or not.

Value counts bar graph of target variable ‘readmitted’
Code implementation of data preprocessing stages

3. Exploratory Data Analysis

3.1 Univariate analysis of A1C Test Results :

A research paper was published in regarding A1C Test concluding that patients whose A1C Test result was conducted were less likely to be readmitted. Let’s check the same.

Analysing the A1C test result distribution and it’s relation with target variable ‘readmitted’

Query : Is it true that patients whose A1C test was conducted are less susceptible to getting readmitted ?

As seen from above graphs, the distribution of patients whose test was conducted and those whose test was not conducted is almost the same. Therefore, we can’t conclude that conducting A1C test implies less chance of readmission.

3.2 Univariate analysis of number of lab procedures

Univariate analysis of number of lab procedures

Query : On an average, how many lab procedures does a patient go through ?

As seen from above pdf curve, the majority of patients underwent 40 to 70 lab procedures with an average of around 43 and a median of 44. The graph of readmitted and non-readmitted patients follows a similar distribution for number_lab_procedures which is normally distributed.

3.3 Univariate analysis of number of procedures

Query : Is high number of procedures indicative of more susceptibility towards readmission ?

There is no such relation as almost half of the patients that were readmitted underwent number of medical procedures. Same can be said about those that were not readmitted.

3.4 Univariate analysis of number of inpatient history

Univariate analysis of number of inpatient visits

Query : Are patients with inpatient encounter history( inpatient != 0 ) more susceptible for hospital readmission as high inpatient frequency denotes poor health ?

Computing percentage of patients that had inpatient history and were readmitted

Yes, from above graphs and percentage calculations, it is evident that 15% of the patients who were admitted before(inpatient>0) were readmitted which is greater than the percentage of patients with no inpatient history which is 8%. Thus, we can conclude that patients with inpatient history are more likely to get readmitted.

3.5 Univariate analysis of number of medications prescribed

Query : Does the high number of medications indicate greater chances of patient being readmitted as high number of medications implies poor health ?

As we can see, the distribution of number of medications is the same for both re-admitted and non-readmitted patients. And number of medications mostly ranges in range of 1–30. This is same for both re-admitted and non-readmitted patients. Therefore, we can conclude that high number of medications does not necessarily imply poor health or more chances of getting readmitted.

3.6 Univariate analysis of age

Univariate analysis of age

Query : Is any age group more susceptible to catch disease and get readmitted ?

We can see patients with age in range 60–80 are more likely to fall ill and get admitted. However, we cannot make any conclusion about readmission status as the distribution of patients who were readmitted and otherwise is same.

3.7 Univariate analysis of gender

Univariate analysis of gender

Query : Is any gender more susceptible to be readmitted ?

No, as we can see, here, distribution of patients of both genders is almost same. Further, 10% of males were readmitted, this is same as that of females i.e. 10% of females were readmitted.

3.8 Bivariate analysis of age and time spent in hospital

Bivariate analysis of age and time spent in hospital(in days)

Query : Is there any meaningful pattern between age and medical stay ?

For people in the age group 50–70 (represented by ‘55’ and ‘65’) we can clearly see a pattern where patients that stay longer are more likely to be readmitted. This is evident from the median stay value represented by white dot insidde the boxplot in violin plot.

Similarly, people above 55 years of age that stay longer are more likely to be readmitted.

Query : Does old age patients require more attention and longer treatment ?

As evident from graph above, patients below age 35 are admitted for hardly 4–5 days, older patients with age more than or equal to 35 tend to need longer medical stay of 6–7 days and even upto 12–13 days for rare cases.

3.9 Bivariate analysis of gender and number of diagnosis

Bivariate analysis of gender and number of diagnosis

Query : Is there any relation between gender and the number of diseases patient is suffering from ? If yes, which gender is more healthy ?

No. As seen here, both males and females are diagnosed with almost the same number of diseases/disorders and there isn’t any particular pattern that would lead us to the conclusion that one gender is healthier than the other. However, we can see, males who were readmitted largely underwent 7–9 diagnosis, females on the other hand had 6–9 diagnosis.

Median number of diagnosis for both males and females for readmitted patients in higher than that of non-readmitted ones. Thus, patients that have more number of diagnosis are more likely to be readmitted regardless of gender.

3.10 Bivariate analysis of race and number of diagnosis

Bivariate analysis of race and number of diagnosis

Query : Is there any relation between race and the number of diseases patient is suffering from ? If yes, which race is more healthy ?

No. As seen here, people of different races are diagnosed with almost the same number of diseases/disorders and there isn’t any particular pattern that would lead us to the conclusion that one race is healthier than the other.

3.11 Bivariate analysis of age and gender

Bivariate analysis of age and gender

Query : From univariate analysis of age, we know that older people are more susceptible to be readmitted. Is it the same for both genders ?

As seen from above box-plot, older females of age greater than 70 are more likely to be readmitted which isn’t the case with males. Thus, we can conclude that older females need more care so as to prevent their readmission.

4. Feature Engineering

Missing values in dataset

Columns ‘diag_1’, ‘diag_2’ and ‘diag_3’, the percentage of missing values is less than 1% and therefore these values can be easily imputed using mode imputation as these are categorical features. Further, we can observe that the number of different values in these columns is around 700. This will certainly lead to a low performance model owing to the curse of dimensionality if the number of rows is not sufficiently big enough for each combination of different values. Here, however, most of the values have a value count of less than 100, many having a single digit value count. Therefore, all those values that have count less than 100 will be grouped under one category.

Number of unique values in features ‘diag_1’, ‘diag_2’ and ‘diag_3’

For feature race 2% of total values are missing. As ‘Caucasian’ is the biggest race in the USA and in our dataset, let’s impute those missing values with the same, let’s perform mode imputation.

Distribution of race of patients

For feature ‘payer code’ 41% data is missing. Let’s name all missing values as ‘UK’ short for ‘unknown’. There must be some pattern behind those missing values. Maybe insurance companies or funds providers prefered to not reveal their details. Feature values ‘OT’, ‘MP’, ‘SI’ and ‘FR’ have counts less than 100, ML model won’t be able to find meaningful patterns with these values as their count is too low. Let’s replace them with ‘MN’ short for ‘minor’ as value.

Distribution of Payer code

For column ‘weight’ 97% of data is missing, so we cannot perform mode imputation or model imputation or implement any other conventional missing value imputation method. Missing data might have some reason behind it. Let’s try to capture the same by converting it into a binary feature such that the missing value will be given the value ‘0’ and those where the value is present let’s give it the value ‘1’

For feature ‘medical speciality’ 49% data is missing. Let’s impute those missing values by creating a separate category for them and group them under the label ‘unknown’. Lot of categorical values of medical speciality have count less than 100. Let’s change their categorical value to ‘minority’

For features like number of diagnosis, number of inpatient visits, number of emergency visits, admission source ID and discharge source IDs there are outliers that have very small count. For all those features, let’s group such small count values together and label them as another category.

4.1 Checking for Data imbalance

Let’s check for data imbalance.

Value count bar graph of target variable ‘readmitted’

Only 1/10th of patients were readmitted in 30 days, thus, our data is highly imbalanced. We will deal with the imbalance in dataset after train test split, as we need to perform oversampling only on train data and not on test data.

4.2 Forward feature selection :

There are around 41 columns, however, we are not sure if all of them are useful. So, let’s perform forward feature selection and select only those features that are actually contributing towards improving model performance.

Code for Forward Feature Selection
Printing the 8 best features obtained along with the highest possible AUC score on validation data

So, we got 8 features that are actually useful and will help in determining the target variable.

4.3 Implementing SMOTE for oversampling to compensate for data imbalance

As we can see, data is highly imbalanced, for models like logistic regression, decision trees and random forest classifier we can use the parameter ‘class_weight’(provided by sklearn) and set it to ‘balanced’ to deal with data imbalance. However, for Naive Bayes and KNN, we don’t have this parameter. So we will perform oversampling using SMOTE technique to compensate the imbalance in dataset

5. Implementing ML algorithms and comparing their performance

Now that we have prepared data and selected the best features, let’s implement calssical machine learning algorithms that satisfy our business constraints. We will implement Logistic Regression, Decision Tree classifier, Random Forest Classifier, Naive Bayes and K-Nearest Neighbour.

Implementing classical ML models

Below are the confusion matrix along with the AUC and Recall score of the trained ML models.

Performance metrics of Logistic Regression

Logistic Regression seem to have performed pretty well as the number of TP is higher than the rest of the models and at the same time, number of FN too is low as compared to other models.

Performance metrics for Decision Tree (left) and K-Nearest Neighbour(right)

Decision tree too seem to perform well. KNN however, isn’t performing well.

Performance metrics of Naive Bayes and Random Forest classifier models

Random forest and Naive Bayes has almost the same auc score but vary in Recall scores, the prior seems to perfrom better from Recall score’s perspective.

5.1 Implementing Boosting algorithms

Algorithms that use boosting tend to perform better than simple ML algorithms. Let’s implement XGBoost Classifier and AdaBoost Classifier to check if they can perform better than other models.

code implementation for XGBoost and AdaBoost classifier

Let’s check their performance.

Performance metrics for XGBoost Classifier and AdaBoost classifier

AdaBoost didn’t performed well but XGBoost seem to perform as good as Logistic Regression model.

5.2 Creating custom ensemble model

Let us go one step further and create custom ensemble model as follows :

  1. Split dataset into train and test set.
  2. Split train dataset into two equal non-intersecting subsets - D1 and D2
  3. Create k subsamples of D1 and build a base model for each subsampled dataset. Train these base models on corresponding subsampled data of D1.
  4. Feed D2 data as the test data set to make predictions to all the base models. Horizontally concatenate all those predictions of k base models to create a new dataset, say meta-dataset. Use labels of D2 as its target variable.
  5. Train meta-model on this meta-dataset
  6. Check the performance of this meta-model on test dataset by feeding test dataset to base models and then stacking the predictions of base models to create meta-dataset which will be input for meta-model.
  7. This combination of base models and meta-model is our custom ensemble model.
Implementing Custom Ensemble model

Here is the score of custom ensemble model.

Performance metrics for ensemble model

Custom built ensemble model performs well but fails to exceed XGBoost or Logistic Regression’s performance.

5.3 Comparing all ML models

Comparing all ML models built so far

XGBoost classifier and Logistic regression both are peroforming well, let’s select Logistic Regression as its recall score is significantly higher than any othe model with AUC in the same range as in Logistic regression.

Printing the best parameters along with selected features to train final model

6. Final model training and saving the same

Now, we have finalised the model i.e. Logistic Regression and the best selected features. Let’s proceed to train our final model, test it’s performance on test dataset and save the same for future evaluation.

The AUC score given by Logistic regression model on Test Data is .65 and Recall score is 0.61. Let’s proceed to deploy the model.

6.1 Model Deployment

I have deployed the model on AWS EC2 instance. You can use it here. I have used Flask API to deploy the model. Currently, it is registered under free tier, but it can be extended to paid tier for large-scale production level.

Web application — Readmission Status predictor

How to use the readmission status predictor ?

  1. Enter all the values for 8 drop down fields.
  2. Press on ‘submit’ button. The page will be automatically reload within a fraction of second.
  3. Press ‘Get Report’ button. A pop up box will show you the report.
Output given by Readmission status predictor

6.2 Further Scope for Imrovement :

Currently, the model is trained on very limited data which is around 50,000 observations. For further improving the model performance, we can invest efforts in data collection so as to minimize the data imbalance. This will certainly lead to even better performing model. If we get ample amount of data, we can use more advance deep learning algorithms to further improve the AUC score of classifier.

6.3 References :

  1. Blog on confusion matrix visualization
  2. Oversampling using SMOTE
  3. Kaggle Data set description
  4. Applied AI Course

Complete code and model implementation is available on my GitHub repository. You can check it out here.

You can connect with me on LinkedIn here.

--

--

--

Analytics Vidhya is a community of Analytics and Data Science professionals. We are building the next-gen data science ecosystem https://www.analyticsvidhya.com

Recommended from Medium

Machine Learning Workflow on Diabetes Data: Part 01

Evaluating the Error of Regression Models: Some Real-Life Challenges and Practical Tips

Data Science👨‍💻: Data Preprocessing using Scikit Learn

Demand Forecasting in Retail — a Probabilistic, Multi-product and Multi-horizon Problem

Drivers of Gold Returns in R

Steps to building a Poker AI — Part 4: Regret Matching for Rock-Paper-Scissors in Python

A work breakdown structure (WBS) for big data analytics projects — Part 3-Process

Cappucino tall or medium? Analyzing Starbucks’ Dataset

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Rishikesh Fulari

Rishikesh Fulari

Learning to teach machines to learn ;)

More from Medium

Measuring model performance can

Decoding the world of Python libraries

Pandas, tips to deal with huge datasets!

Learn Linear Regression In Machine Learning From Scratch

Linear Regression