Advanced Imputation of Missing Values
Lessons from WiDS 2020 competition on what not to do.
The challenge was to create a model that uses data from the first 24 hours of intensive care to predict patient survival. MIT’s GOSSIS community initiative, with privacy certification from the Harvard Privacy Lab, has provided a dataset of more than 130,000 hospital Intensive Care Unit (ICU) visits from patients, spanning a one-year timeframe. This data is part of a growing global effort and consortium spanning Argentina, Australia, New Zealand, Sri Lanka, Brazil, and more than 200 hospitals in the United States.
The data we had was really messy. There were 90 thousand rows and almost 190 columns!
Most columns had up to 80% missing values which made it very difficult to even begin the analysis. As can be seen below most columns had non-normal distributions and weird skews. Adding to that the class imbalance, with only around 8% of the data being of the positive class.
Our EDA began trying to answer the following the questions
- What features influence the survival of the patients the most ?
- How does age affect patient survival ?
- Are older patients at higher risk irrespective of the condition?
- Is there any strong correlation between disease/Condition and the number of fatalities ?
- Is there a correlation between disease/Condition and patient admissions and re-admissions?
We tried to follow a line of thought as to what type of data we have at hand, who are the patients, what are the demographics , what are the main conditions because of which patients are being admitted etc.
To understand whose data we actually have, we try to drill down along demographic lines.
We use Ethnicity and Age to drill into the data. From the Ethnicity Pie Chart, we see that most of the individuals in our dataset are Caucasians followed by African Americans.
If we add another dimension — Age, to the mix — we see an interesting pattern in the bar chart. We see that for African Americans in particular, the proportion of patients declines with age. We see that there are a number of young african american patients accounting for around 25–30% of total young patients but as we go upward in the chart to the older population we can see that the proportion drops to an average of around 10%.
Having seen that most of the people end up in ICU because of an accident or emergency. We would like to see what kind of cases are usually seen in the ICU. For that we make a tree map of the diagnosis column. As it turns out most of the cases are Cardiovascular — heart attacks. We see that neurological cases , which might be from car crashes or other accidents , Respiratory and Sepsis cases also account for a large chunk of cases
If we further split the data based on different conditions, we see that most of the patients had diabetes. Comparing fatal cases to non fatal cases, we see an interesting pattern — there are different proportions for diseases in the two cases. Even though diabetes accounts for almost 80% of the non-fatal cases, it comprises only 60% for the fatal ones. Proportion for Hepatic Failures almost doubles from 5% to around 10% and for immunosuppression they go from around 10% to 15%.
Using the graph, we can compare the range and distribution of various attributes for Survived patients(Hospital_death=0) and Not survived patients(Hospital_death=1).
From the above box plot, we can see that the median of apache_4a_icu_death_prob for Survived and Not Survived patients differ which means that hospital death may be dependent on apache_4a_icu_death_prob
From the above box plot, we can see that the median of apache_4a_hospital_death_prob for Survived and Not Survived patients differ which means that hospital death may be dependent on apache_4a_hospital_death_prob
Since most of the missing values were falling under the initial hour observation(h1) and these values were explained by the values present in d1, the variables missing more than 50% in h1_variables were dropped.
We tried a number of approaches for imputing missing values
Idea was to cluster observations and then use the assigned clusters for imputing the values based on observations that belonged to the same cluster.
clusters = KMeans(n_clusters = 50).fit(ndf)
But this code fails since Sklearn’s Kmeans implementation is all in memory therefore it needed 13.7 gb of space which was prohibitive — therefore we tried a number custom implementation using Dask (parallelization) , Numba (compilation) and KD Trees (approximation). There also exists MiniBatchKmeans which we deferred to later time. Instead we tried Neighbor based imputation using KNN and KD trees.
Neighbor based Imputation
Following similar line of thought as with clustering we tried to implement K-nn Implementation but using Dask and Numba
A. Dask Based
Dask figures out how to break up large computations and route parts of them efficiently onto distributed hardware. Dask is routinely run on thousand-machine clusters to process hundreds of terabytes of data efficiently within secure environments. But more importantly Dask can empower analysts to manipulate 100GB+ datasets on their laptop or 1TB+ datasets on a workstation without bothering with the cluster at all. Dask can enable efficient parallel computations on single machines by leveraging their multi-core CPUs and streaming data efficiently from disk
Dask uses existing Python APIs and data structures to make it easy to switch between Numpy, Pandas, Scikit-learn to their Dask-powered equivalents.
In our application of Dask, the idea was to allow Dask to handle the partitioning and apply pairwise_distance function to each chunk.
Numba is a just-in-time compiler for Python that works best on code that uses NumPy arrays and functions, and loops. The most common way to use Numba is through its collection of decorators that can be applied to your functions to instruct Numba to compile them. When a call is made to a Numba decorated function it is compiled to machine code “just-in-time” for execution and all or part of your code can subsequently run at native machine code speed!
In this approach we hoped to speed up per-row computation by compiling it using numba. Also tried to speed up neighbor finding using KD-tree, which are approximate but that was good enough for imputation purposes.
We had started out with Simple Mean imputations before trying to use the advanced approaches mentioned here to improve our model ROC-AUC score.
Well at the end of the day let’s just say it wasn’t worth it. I will directly quote Sklearn’s page on performance
The optimizations we tried didn’t really work and the performance gains were only marginal. Now you know 3 ways how not to handle missing values.
At the end we resorted to Random Forest imputations and contrasted them simple Mean and Median imputations and the overall model performance in terms of ROC-AUC wasn’t much different.
Simple things sometimes do work out in the end.
If you have any suggestions or feedback contact me on linkedin