How we applied AI to prevent sepsis in preterm babies
A case study on using XGBoost for time series forecasting to predict the onset of sepsis in preterm infants within a 12-hour prediction horizon.
Visit our website to learn more and join our impact challenges.
Improving the chances of preterm-born infants
About 20% of the preterm-born infants admitted to the newborn intensive care unit (NICU) will develop sepsis which is related to a higher mortality and adverse long-term effects. In the AI for Health — Sepsis Prevention project we applied machine learning to accurately predict whether a preterm baby is going to develop sepsis. Sepsis is a reaction to an infection and can be life-threatening. Early prediction of the onset gives doctors the necessary time to apply preventative measures.
Healthcare professionals meet data scientists
Our team of 4 data scientists worked for 20 weeks to address the problem. We worked in a very close collaboration with the hospital. Intensive care is an extremely sensitive topic in medicine and we had to establish good basic comprehension of the application area. Since we were all located in the Netherlands, we even got a chance to visit the premises of the NICU in person for some hands-on experience.
We decided to split the team and work on two solutions. One was trying to improve the existing logistic regression model that UMC Utrecht developed. The other one was to create a new XGboost model that would outperform the improved existing model.
Existing logistic regression model got an upgrade
We started with improving the already existing logistic regression model UMC Utrecht used. We applied two methods:
- hyperparameter optimization and
- feature engineering
For the hyperparameter optimization we used random search and grid search with cross validation. We did this on the original set of features on only a portion of the patients due to computational limitations . We still managed to improve the ROC/ AUC score from 0.56 to 0.67.
The goal of the second approach was to create new features based on the minimum, mean, variance, peaks, and drops of those original features which have a low number of null values. We computed their feature importance by using a simple regression model and gradient boosting model to measure each feature’s influence on the outcome of the model.
We computed a correlation matrix, which gave us the opportunity to manually explore possible feature combinations for our logistic regression model. Finally, we used a sequential feature selector to find the best 5 to 8 features for our model. We found that the best combination of features was:
- HF mean (2h) — heart frequency mean over a 2 hour interval;
- SpO2 drops (2h) — how many oxygen saturation drops occur in a 2 hour interval;
- HF variance (2h) — heart frequency variance over a 2 hour interval;
- Bradycardia (2h) — how often a too slow heart rate occurs in a 2 hour interval;
- AdemF mean — breathing frequency.
Despite improving the ROC/ AUC score from 0.56 to 0.67 we still operated at a near chance level. The change had to be more fundamental for a usable outcome. We had to reformulate the problem. We’ve realized quite early that:
- The logistic regression model didn’t cater to the problem — it just flagged an off-the-chart value (or combination of values) when something was already happening. This might not give the medical staff the convenient time window that would ideally be sought after, to maximize the probability of a successful intervention.
- We would have to move away from the real-time analysis of a data stream approach to predicting future events from chunks historical of data. A classic time series forecasting problem!
“We needed a model to predict which patient will develop sepsis within the next 12 hours. A traffic police officer that doesn’t just direct & flag real-time data traffic, but one that knows on which intersection important stuff will happen. Show me what traffic crosses the intersection right now and I’ll tell you what happens in the next 12 hours of time — that’s the goal. ” — Kamal Elsayed, AI for Health engineer
Developing a new XGBoost time series forecasting model
The goal here was to create an XGBoost time series forecasting classification model that predicts the onset of sepsis in preterm infants within a 12-hour prediction horizon.
The model was trained on these features:
- arterial blood pressure
- diastole arterial blood
- pressure systole
- incubator measured temperature
- monitor temperature
- heart rate pulse
- heart rate pleth
- monitor heart rate
- respiratory rate
- O2 saturation
- gestation age
- gender
Data Pre-processing
The aforementioned features are minute-by-minute time-series data streams which were recorded from a set of invasive and non-invasive medical instruments in the incubator. Each feature data stream could extend to multiple days. An event feature shows per timestamp if a notable medical intervention or an administrative event occurred. Notable events include: admission, discharge, death, negative or positive blood cultures. A positive blood culture confirms a positive sepsis case, and the corresponding timestamp is marked as sepsis onset (t_sepsis).
For every patient, each of the 10 physiological markers was subset to the 12 hours of data that directly preceded a sepsis timestamp in a case patient, or control timestamp in a control patient. Number of positive patients equaled the number of negative. In order to maintain a balanced dataset, 398 control patients were pseudo randomly drawn from the 2196 control pool. The selection was constrained to match the distribution of the gestation age.
Over this extracted 12 hours segment, a sliding window of length 3 hours was run on every physio-marker to aggregate a set of 8 statistical features. This created a total of 320 training features per patient (10 physiological x 8 statistical x 4 Intervals (3hrs)). Additionally, we added the gestation age and gender as features.
The targets used in model training were derived from the event feature. A 12 hours segment from a patient was assigned the positive class if it directly preceded a sepsis timestamp. If no sepsis event followed, the segment was assigned the negative class.
Model Validation
This model was trained and evaluated using a repeated nested cross-validation procedure to simultaneously search for the optimal parameters and evaluate the test scores. Both the inner and outer cross-validation loops used k = 4 and each loop was repeated 10 times. The inner loop used a random search over a set of probability distributions.
The model reached an average precision of .90 on the test set as seen in the graph below. This shows promise for actual implementation of the model. Further testing on new unseen data should prove generalizability and clinical feasibility.
Explainable AI in action
We developed a prediction interpretability analysis of the XGBoost model using SHAP. SHAP is a novel model agnostic technique that is used to explain predictions and model the decision process using so-called Shapley’s values.
Top 6 features, averaged absolute SHAP
The most important findings from this model were:
- Minimal incubator measured temperature (int. 3) had highest average absolute SHAP values.
- Both mean and median heart frequency had high impact in the 20 most impactful features.
- Most prominent features without consideration of filters were then incubator measured temperature, arterial blood pressure systole and heart frequency.
- Unlike in the original logistic regression model, in our XGBoost model O2 saturation was significantly less dominant. It wasn’t present in the top 10 features.
Due to a different pre-processing of the dataset we couldn’t compare the performance of the 2 models and the influence of measured features one-to-one. The XGboost superiority lies in its prediction capability — it gives the hospital enough advance time to pay attention to specific patients. Our suggestion to the hospital team was to further validate the model accuracy on patient data in clinical practice.
Missing data held back more advanced techniques
Our team tried to include a more advanced filtering technique — the fast Fourier transform, to build new features from the heart rate variable. But this technique proved to be difficult to implement due to a high number of missing data. Dealing with the missing data and working in a virtual environment with limited memory proved to be the biggest hurdles throughout the Challenge.
We know that it’s extremely difficult to do measurements on preterm babies consistently but it would be of great benefit. Measuring patient data without interruption would ensure a low number of missing values for crucial features like heart rate or O2 saturation; improving the prediction models immensely.
What I learnt about applying ML in real life (and about the need for Explainable AI)
What surprised me the most in this Challenge, was really how many factors come into play when you design a machine learning solution for a particular purpose. It is not just about which algorithm performs the best. In this case it was very important that the outcome of the algorithm was explainable. Doctors have to make medical decisions based on these outputs and therefore can’t just trust it blindly.
UMC Utrecht considered our results a success and is already planning similar initiatives to deploy AI for clinical purposes. Both sides learned a lot from each other; our team of data scientists got seasoned in medical AI and the hospital got valuable machine learning models as well as a blueprint for similar projects.
I’d love to give a shout out to the entire AI against Sepsis team. We did good and learned on the way. By improving the existing and creating a new model, we hope that more preterm babies’ sepsis can be signaled early on. When the babies receive their treatment earlier, severe consequences will be prevented and this might even save some lives.
Laura Didden
AI for Health Engineer
AI for Health — Predicting Sepsis Team: Kamal Elsayed, Simon Sukup, Simona Stoyanova, Laura Didden