Using Machine Learning to Predict Whether the Patient Will Be Readmitted.
We are in an age where machines are utilizing huge data and trying to create a better world. It might range from predicting crime rates in an area using machine learning to conversing with humans using natural language processing. In this blog article, I am going to take you through a real-world data science problem which I have picked from UCI machine learning repository and will demonstrate my way of solving it. This case study solves everything right from scratch. Starting from data analysis and taking you through feature engineering and at last model building with both machine learning and deep learning models.
Note: Full code and data files are available on my Git-hub repo here.
Problem Statement
It can be hard to know whether a patient will be readmitted to the hospital, it might mean the patient didn’t get the best treatment on the last occasion he/she was admitted or the patient might be diagnosed wrongly and treated for a different disease altogether. The patient when seen at first can’t be predicted whether he/she will be readmitted or not but lab reports and the details of type of patient can be very useful in predicting whether the patient might be readmitted within 30 days. The main objective of this case study is to check whether the patient with diabetes will be readmitted to the hospital within 30 days.
Index:
- Step-1: Mapping the real world problem to a Machine Learning Problem.
- Step-2: Exploratory Data Analysis by performing uni variate, bi variate and multi variate analysis on the data.
- Step-3: Feature engineering by adding new features and selecting important features from the data.
- Step-4: Creating machine learning and deep learning models to predict hospital readmission.
Step 1: Mapping the real world problem to a Machine Learning Problem
Type of Machine Learning Problem:
For the given patient we must predict whether the patient will be readmitted within 30 days or not given patient details including the diagnosis and medications the patient has taken.
The given problem is a classification problem as it will return whether the patient will be readmitted within 30 days or not.
Error metric : F1 score and AUC(Area under curve) score.
Data
Data overview:
Get the data from : https://archive.ics.uci.edu/ml/datasets/Diabetes+130-US+hospitals+for+years+1999-2008#
The dataset represents 10 years (1999–2008) of clinical care at 130 US hospitals and integrated delivery networks. It includes over 50 features representing patient and hospital outcomes. Information was extracted from the database for encounters that satisfied the following criteria.
(1) It is an inpatient encounter (a hospital admission).
(2) It is a diabetic encounter, that is, one during which any kind of diabetes was entered to the system as a diagnosis.
(3) The length of stay was at least 1 day and at most 14 days.
(4) Laboratory tests were performed during the encounter.
(5) Medications were administered during the encounter.
The data contains such attributes as patient number, race, gender, age, admission type, time in hospital, medical specialty of admitting physician, number of lab test performed, HbA1c test result, diagnosis, number of medication, diabetic medications, number of outpatient, inpatient, and emergency visits in the year before the hospitalization, etc.
We will be looking into each of the 50 features in detail when we perform exploratory data analysis.
Target variable: Readmitted
We will build various machine learning and deep learning models and see which provides the best result. Now let us start with exploratory data analysis
Step 2:Exploratory Data Analysis
The very first step in solving any case study in data science is to properly look and analyze the data. It helps to give valuable insights and information. Statistical tools has a big role in proper visualization of the data. ML engineers spend maximum part of solving a problem by analyzing the data they have and this step is considered as an important step as it helps us to understand the data precisely. Proper EDA gives interesting features of your data which in turn influences our data preprocessing and model selection criterion as well.
Loading the data:
To load the data, we only need the diabetic_data.csv file. We will load this into a pandas dataframe.
There are a total of 101766 patient records with 50 features for each record. The first 5 patient records can be seen above.
Checking multiple inpatient visits:
The data contains multiple inpatient visits for some patients, I have considered only the first encounter for each patient to determine whether or not they were readmitted within 30days. So the duplicate values are removed.
Removing patients who are dead or in hospice:
The dataset also contains patients who are dead or in hospice. In the IDs_mapping.csv provided in https://www.hindawi.com/journals/bmri/2014/781670/#supplementary-materials we can see that 11,13,14,19,20,21 are related to death or hospice. We should remove these samples from the data since they cannot be readmitted.
We are left with 69,973 patients who are not dead and not in hospice.
Checking for null values in the data set:
The nan values are represented as ‘?’ in the dataset. We will replace ? with nan and then check the total nan values in the dataset.
We will be checking the percentage of null values in each feature,
7 features contain null values.
We can observe that weight has the highest null values at 96%. Medical specialty and payer code have 48% and 43% null values respectively. The weight feature can be dropped since there is very high percentage of null values.
The payer code and medical specialty column missing values can be found using imputation techniques since more than 50% data is available in both cases. We will be dealing with this features afterwards.
Uni variate Analysis:
Race:
The race column consists of Caucasian,AfricanAmerican, Hispanic, Asian and other as categories. It consists of 2.7% NaN values
We can see that the patients are dominated by Caucasian people, followed by African American’s and Asian’s are least in number. The nan values are filled with mode of race feature.
Gender:
The gender column tells us whether the patient is male or female.
3 values are unknown. We can either fill these values or drop the rows. Dropping the rows would be better since the data is also considered as invalid/Unknown in case of gender.
- The female count is more than male count but the difference is small.
- I have encoded the label of male to 1 and female to 0.
Age:
- As expected, the patients with age less than 40 years are less in number when compared to patients with age greater than 40 years.
- The number of patients are highest in the age group of 70–80 years.
I will be grouping the age feature into 3 categories as mentioned in the research paper (https://www.hindawi.com/journals/bmri/2014/781670/).
The plot after grouping age looks like this,
Admission_type_id:
The mappings can be obtained from mappings given in uiuc,
In Admission_type_id most of the patients are admitted with id emergency, followed by Elective. Some of the patients admission type id is not available. Null and Not mapped categories are also present.
Discharge_disposition_id:
The discharge disposition id consists of 29 categories of id’s.
- The discharge_disposition_id column is divided into 21 different categories which is then changed to 8 categories after careful observations
- We can observe that most of the patients are discharged to home.
- The patients who have passed away or in hospice are not present since we have already removed those rows from the data.
admission_source_id:
- admission_source_id mappings are given in the ids_mappings.csv present in UCI.
- The categories were changed from 17 to 8.
- We can observe that most of the patients admission source is emergency room, followed by referrals.
Time_in_hospital:
- The time in hospital column categorizes the patients stay ranging from 1 day to 14 days.
- The patients on average stay 4 days and most patients stay 3–4 days.
- The patients rarely stay more than 12 days.
- We can observe a positive skew in the plot.
num_lab_procedures:
Refers to number of lab tests performed during the encounter.
- We can observe that on average 43 lab procedures are done during a patient encounter.
- A spike is also found near 0–2 procedures which suggest less number of lab tests were done on some patients.
Num_procedures:
Refers to number of procedures (other than lab tests) performed during the encounter
Most of the patients do not perform tests other than lab tests. Positive skew is observed.
Num_medications:
Refers to number of distinct generic names administered during the encounter
- Most of the patients are provided 16 medications on average.
- Only 7 patients are given more than 70 medications.
- The plot has positive skewness and resembles normal distribution.
Number_outpatient:
Refers to number of outpatient visits of the patient in the year preceding the encounter
- We can observe that most of the patients do not have any outpatient visits.
- Very less patients have more than 15 outpatient visits
Number_emergency:
- It is similar to number_outpatient distplot.
- We can observe that most of the patients do not have any emergency visits.
Number_inpatient:
- We can observe that most of the patients do not have any inpatient visits.
- It is similar to other visit figures seen.
- We can create a new feature ‘visits’ which will be some of inpatient, outpatient and emergency visits since all three are distributed in similar ways.
Diagnosis:
All three diagnosis features contain code which are categorized into one of the 9 groups. The groups are given in the research paper.We can categorize these codes into the 9 categories and use them as diagnosis of diseases which come under these 9 categories. This idea has been taken from the research paper.
The new categories are analyzed,
- In the second and third diagnosis we can observe that more number of patients are getting diagnosed with 4 which is diabetes mellitus.
- Most of the patients are diagnosed with respiratory and other disease types.
- The nan category also increase with diagnosis number. It is represented as -1 in the feature.
Number_diagnoses
Refers to the number of diagnoses entered to the system
- Most patients have undergone 9 diagnoses.
- More than 9 diagnoses is rare.
Max_glu_serum
Indicates the range of the result or if the test was not taken. Values: “>200,” “>300,”“normal,” and “none” if not measured
- Most of the patients dont undergo this test.
- Out of the people who undergo this test about half of the patients result are normal, the other half patients result are either in category >200 or >300.
- Ordinal encoding is done since max_glu_serum above certain values indicate the value is abnormal for the patient and hence are more important in predicting the re admittance.
A1Cresult
Indicates the range of the result or if the test was not taken. Values: “>8” if the result was greater than 8%, “>7” if the result was greater than 7% but less than 8%, “normal” if the result was less than 7%, and “none” if not measured.
- Most of the patients don't undergo this test.
- Out of the people who undergo this test nearly half of the patients result are >8, the other half patients result are either in >7 or normal category.
- Ordinal encoding is done since A1Cresult above certain values indicate the value is abnormal for the patient and hence are more important
Medications
Values: “up” if the dosage was increased during the encounter, “down” if the dosage was decreased, “steady” if the dosage did not change, and “no” if the drug was not prescribed.
There are a total of 23 medications and values for each medications is given. When analyzed, it was found that 3 medications were not prescribed for any patient. These 3 features don't help to classify whether the patient readmitted within 30 days as all the values are same. So these features are dropped from the dataset.
The medications can be merged into a single feature and the number of medications a patients has taken can be calculated.The custom encoding of medication was done with any change in dosage resulting in 2, steady given as 1 and if the medication was not required then it is represented as 0.
Change:
Indicates if there was a change in diabetic medications (either dosage or generic name). Values: “change” and “no change”
- More than 50% of patients did not get any changes in the medicine, the other patients changed medicines.
- The change feature was encoded with 0 representing no change and 1 representing change in medication.
DiabetesMed:
Indicates if there was any diabetic medication prescribed. Values: “yes” and “no”.
- Most of the patients were prescribed diabetes medication.
- The diabetesMed feature was encoded with 0 representing not prescribed and 1 representing medicine prescribed.
Readmitted:
This the the variable which we must predict.
It refers to days to inpatient readmission. Values: “< 30” if the patient was readmitted in less than 30 days, “>30” if the patient was readmitted in more than 30 days, and “No” for no record of readmission.
- We must predict whether the patient is readmitted within 30 days.
- From the graph we can observe that less number of people are readmitted within 30 days and most of the people are either not readmitted or are readmitted after 30 days.
- oversampling/ under sampling techniques will be required to make the data balanced.
Payer_code:
As mentioned in the section before, the payer_code feature consists of 43% of null values. These null values can be filled by values predicted from model based imputation techniques. Here i have used KNN and randomforest models for imputation.
There are a total of 17 types of payer codes. The payer code feature is encoded and separated from other columns. The other features of the payer code for which the values are not null are used as the training dataset and a model is fit on this data. The null values in the payer code are predicted using the model. KNN and randomforest models were used on the data and after hyperparameter tuning it was found that randomforest model perform better on predicting the null values when compared to KNN.
- We can observe that after imputation, the number of patients for whom the payment was done by medicare has increased drastically as expected. It was the most populated column before imputation also.
- Other categories have seen 10% increase in number at most after imputation.
- The plot strikes as a case of pareto distribution.
medical_specialty:
Integer identifier of a specialty of the admitting physician, corresponding to 84 distinct values, for example, cardiology, internal medicine, family\general practice, and surgeon
The medical specialty feature consists of 48% of null values. These null values can be filled by values predicted from model based imputation techniques. Here i have used KNN and randomforest models for imputation. There are 70 different medical specialty categories.
The medical specialty feature is encoded using scikit-learn’s label encoder. Only the values which are not null are encoded and the features except medical specialty are used for predicting medical specialty. The null values in the medical specialty are predicted using the model fit on the other features. KNN and randomforest models were used on the data and after hyperparameter tuning it was found that randomforest model perform better on predicting the null values when compared to KNN. The encoded features are decoded using inverse transform of scikit-learn’s label encoder.
The plots contain many categories and cannot be shown, please check the notebook mentioned if interested.
- The InternalMedicine category has multiplied 3 fold after filling the missing values through imputation.
- Other categories have nearly doubled after imputation.
- The InternalMedicine category dominates followed by family/general, cardiology and emergency/Trauma.
Conclusion of Univariate analysis:
- The outpatient, inpatient and emergency visits can be merged into a new feature visits.
- Three features from medications are removed since they do not provide any information which might help to predict readmission of patients.
- The medications can be merged into a single feature and the number of medications a patients has taken can be calculated.
- The diagnosis features have been changed from icd9 codes to 10 different categories. The plots indicate that the diagnosis of diabetes mellitus increase as the number of diagnosis increase.
- Model based imputation was applied on the features with missing values.
- The categorical labels can be one hot encoded to convert categorical labels to numerical data.
- The data is highly unbalanced with only 9% of patients being readmitted within 30 days. over sampling must be done.
Bi variate and multivariate analysis:
Only plots which had some observations have been plotted, other plots which didn’t any useful information have been discarded.
Age:
- Most of the readmitted patients are from age 60–100 which is category 2 in this case.
- The readmission increases with age.
- the patients in age category 1 had primary diagnosis of 0 and 4 more in number.
- The patients in age category 2 had primary diagnosis of 0&1 more in number.
- The patients in age category 3 had primary diagnosis of 2 more in number followed by diagnosis 2.
- The patients in category 3 stayed in hospital for more time and than category 1 and 2.
- The readmitted patients are diagnosed with category 0,1 and 4 diseases.
Race :
- Asian, other race and Hispanic patients show similarity when compared among most of the features.
- The mean of African American patients time in hospital is more compared to other race patients.
- African American patients are admitted under category 1 which is not the case in other patients who are admitted under category 2.
- Readmitted patients have stayed more than non-readmitted patients when time in hospital is taken into consideration except asian patients.
- other race patients who have readmitted have mean admission id as 1 when compared to non-readmitted other race patients who have mean admission id as 2.
Gender:
- Females spend more time in hospital when compared to male patients.
- In case of readmitted patients, male readmitted patients paitents tend to spend more time in hospital when compared to male non-readmitted patients.
- most of the male readmitted patients were diagnosed as category 1 whereas most of the male non-readmitted patients were diagnosed as category 2.
- Most Males have admission id as 2 whereas most females have admission id as 1.
Admission type id:
- The patients who were admitted under id 7 spent more time in hospital than other admission id patients.
- The patients who were admitted under id 8 spent least time in hospital than other admission id patients.
- For admission id 7 patients payment was most done through code SI.
- Less patients were readmitted who were having admission id as either 4 or 7.
- In category 1,3,5 and 6 readmitted patients spent more time in hospital whereas in category 8 readmitted patients spent less time in hospital when compared to non-readmitted patients.
- Most max_glu_serum test were done when patients were admitted under category 5,6 and 7. For patients under category 1,2,3 less max_glu_serum tests were done.
discharge_disposition_id:
- Readmitted patients with discharge id 3 spent more time in hospital when compared to non-readmitted patients with id 3.
- Most of the readmitted patients with discharge id 3 spent had been diagnosed of category 2 disease when compared to non-readmitted patients with id 3 who were diagnosed of category 1.
Plotting Correlation matrix:
A correlation matrix is basically a covariance matrix that is a very good technique of multivariate exploration.
- From the above matrix we can observe that features like num_medications,number_diagnoses, num_lab_procedures tend to have positive correlation with time in hospital.
- Diagnosis features show very less correlation with other features.
- Readmitted also shows low correlation with other features indication linear relationship is not present with the features.
Checking Multi-collinearity with VIF values:
Variance Inflation Factor or VIF, gives a basic quantitative idea about how much the feature variables are correlated with each other.
On checking the VIF for our features, the number_diagnoses has vif value of 15.7 and age has vif value of 10, which are more than 10. After dropping number_diagnoses feature, the vif value for all features remained below 10.
Conclusion of bivariate and multivariate analysis:
- Asian, other race and Hispanic patients show similarity when compared among most of the features.
- Readmitted patients have stayed more than non-readmitted patients on average when time in hospital is taken into consideration except asian patients.
- Females spend more time in hospital when compared to male patients on average.
- Most Male patients have admission id as 2 whereas most female patients have admission id as 1.
- Most max_glu_serum test were done when patients were admitted under category 5,6 and 7. For patients under category 1,2,3 less max_glu_serum tests were done.
- Readmission has less correaltion with other features.
- Time in hospital has good correlation with other variables.
- number_diagnosis feature has high VIF value and hence it is removed. After its removal the VIF value of other features remains in the range of 0–10.
Step 3: Feature engineering:
visits feature:
As seen in uni variate analysis the inpatient visits, outpatient visits and emergency visits can be combined into a single feature called visits. Since many of the patients didn’t get visited by anyone, we can make the visit feature binary meaning the feature can be whether the patient got visits or not. This can be done by replacing the patients who got visits by 1.
The inpatient, outpatient and emergency visits feature are dropped.
number of steady medicines and increase/decrease of medicine given to each patient feature
Two new features are derived from the 23 medication features present in the dataset. The first feature is ‘steady’ which tells us how many number of medications the patient is taking steadily. The second feature is up/down which tells us how many number of medications have been increased or decreased or changed in dosage for the patient.
Feature Selection
Before applying any machine learning model, our data must be fed to models in proper format. In our problem, most of the data in the columns is of categorical nature. Hence, they have to be converted to numerical format to extract relevant information. Though there are many ways to handle categorical data, one of the commonest way is to do One-Hot Encoding. Here i had used get_dummies of pandas to achieve one hot encoded features.
We will split our data set into train and test. Since the data is imbalanced we will be applying SMOTE to achieve balanced dataset by oversampling.We will be oversampling only the train data and not the test data. We shall fit on the entire train data and use test part for prediction purpose. Also, we shall evaluate which model performs the best on test data based on our evaluation metric AUC and f1 score.
After oversampling we will use permutation importance for feature selection. and select only those features which result in positive weight for permutation importance. These are the features which are selected and these features will be used in training our model. The feature selection was done using permutation importance. Out of the 174 columns, only 82 columns were considered as important to predict the number of patients who got readmitted before 30 days.
Step 4: Modelling:
After having done with the analysis and cleaning of data, we did feature engineering and added three new features- visits,steady and up/down from the already existing features. We also handled categorical and numerical features. We over sampled the data and selected the best features from the data. We are now ready to apply machine learning algorithms on our prepared data.
I wanted to experiment with both machine learning and deep learning models, so i have built and experimented with 4 machine learning models and 4 deep learning models.
- Logistic regression
- Decision Tree
- Random forest
- Xgboost
- Deep neural network
- CNN based model
- LSTM based model
- Model using lstm and cnn combined.(ConvLSTM)
I have hyper parameter tuned each of the machine learning models. From the best hyper parameters that was obtained by training on the train data set, it was used to predict the price values on test data set and compare the AUC and f1 score values of each of these models.
Logistic Regression :
AUC score of logistic regression model is : 0.925
F1 score of logistic regression model is : 0.920
Decision Tree :
AUC score of decision tree model is : 0.905
F1 score of decision tree model is : 0.905
Random Forest :
AUC score of random forest model is : 0.946
F1 score of random forest model is : 0.943
XGBoost :
AUC score of xgboost model is : 0.933
F1 score of xgboost model is : 0.929
Deep Neural network :
AUC score of xgboost model is : 0.960
F1 score of xgboost model is : 0.938
CNN
AUC score of xgboost model is : 0.959
F1 score of xgboost model is : 0.936
LSTM :
AUC score of xgboost model is : 0.960
F1 score of xgboost model is : 0.938
ConvLSTM :
AUC score of xgboost model is : 0.958
F1 score of xgboost model is : 0.938
Results:
- We can observe that LSTM dominates when auc score is considered.
- We can observe that random forest dominates when f1 score is considered.
Conclusion:
This was my first self-case study and my first medium article, I hope you enjoyed reading through it. I got to learn lot of techniques while working on this case study. I thank AppliedAI and my mentor who helped me throughout this case study.
This concludes my work. Thank you for reading!
References:
- https://archive.ics.uci.edu/ml/datasets/Diabetes+130-US+hospitals+for+years+1999-2008#
- https://www.kaggle.com/dansbecker/permutation-importance
3. http://www.hindawi.com/journals/bmri/2014/781670/
4. https://www.appliedaicourse.com/
You can also find and connect with me on LinkedIn and GitHub.