Can AI detect the risk of heart failure from ECGs?
Electrocardiogram data was subjected to a sweeping array of machine learning and deep learning models. Is it as good a predictor of heart failure risk as blood tests?
Can AI help predict heart failure? Machine learning is not magic, but it can be very helpful in recognizing patterns and healthcare issues aren’t an exception. We, a group of AI for Health engineers from all over the world, teamed up with physicians from Catharina Hospital in Eindhoven, Netherlands for the AI for Health: Heart Failure Detection Challenge. Imagine a bunch of passionate people with different backgrounds — data scientists, biomedical engineers, computer scientists and software engineers in the same virtual meeting working on a common goal.
We were ready to jump into an adventure.
Linking ECGs to blood test values
Previous scientific research made a link between higher levels of NT-proBNP (N-terminal natriuretic peptide) in blood and left ventricular dysfunction causing heart failure (Bhalla, V., Willis, S., & Maisel, A. S., 2004). We established that higher levels of NT-proBNP were key indicators of heart failure risk. The physicians of Santa Catharina Hospital highlighted that NT-proBNP blood tests were less accessible than ECGs measurements in some places. Our goal was to figure out if we can estimate the NT-proBNP values from electrocardiograms using machine learning.
We had the privilege to work with well prepared datasets, so we could start analyzing the data and dive into modeling work right away. What was at our disposal?
- 143.392 records of 12-channel ECGs from real Santa Catharina Hospital’s patients. Each record had 10 seconds of recorded anonymized data, labeled with ‘age’ and ‘gender’ of the patients and the timestamp of the measurements.
- The level of NT-proBNP (N-terminal natriuretic peptide) in the blood of the patients (anonymized) and the timestamp of those measurements.
Timestamps were used to select the training data. The 3 subteams experimented with different time spans between the blood measurement and ECG recordings to figure out what works best for the model’s performance. The consensus ended up being a 30 days window in which both measurements were taken.
Straight into modeling with well prepared patient data
2 different classification approaches were taken:
- binary classification — models distinguishing between high and low risk levels
- 3-class classification — models were trained to classify for low, medium, high risk according to NT-proBNP levels.
NT-proBNP range for classification models.
A normal level of NT-proBNP based on Cleveland Clinic’s Reference Range is:
- Less than 125 pg/mL for patients aged 0–74 years
- Less than 450 pg/mL for patients aged 75–99 years
Following NT-proBNP levels could mean your heart function is unstable:
- Higher than 450 pg/mL for patients under age 50
- Higher than 900 pg/mL for patients age 50 and older
Classification with machine learning techniques
Unsurprisingly, Machine Learning approaches with no feature extraction or data processing were unsuccessful in regression modeling. However, we achieved some remarkable results in classification with them.
Different algorithms were deployed:
- Logistic Regression
- Support Vector Machine
- AdaBoost
- Random Forest
- Random Under-Sampler Boost Classifier
- XGBoostClassifier
The binary classification experiments with the best performance were using XGBoostClassifier. In the test set (80% of majority class: high risk; 20% of minority class: low risk) we were able to reach 83% accuracy with 94% recall for the relevant class (high risk) and 50% f1-score for the minority class, considering only one measurement per patient and without repeating patients from training or validation set.
Figure 1: Confusion matrix and Precision, Recall, F1-Score and Accuracy for binary classification with XGBoostClassifier
As for the 3-class model, the algorithm with the best performance was AdaBoost, with 72% accuracy, 83% f1-score for the highest risk class and 63% f1-score for the lowest risk class.
Figure 2: Confusion matrix and Precision, Recall, F1-Score and Accuracy for 3-class classification with AdaBoost
ResNet: Deep neural network for a regression problem
ResNet models were trained in order to predict the NT-proBNP values using regression. As the values of NT-proBNP seem to live on a log-scale we first take the logarithm of the values and then normalize to get values between 0 and 1.
We trained a ResNet with hyperparameter search to find the best fit. The first input of the model was the stacked ECG signal, and after the ResNet layers, the age and gender of the patient were concatenated to the output of the ResNet blocks. This was fed through several feed forward layers; and at the end, we used a sigmoid activation function to make sure the predicted value was between 0 and 1. Finally, we used early stopping on a validation split to find a suitable model.
Figure 3: Deep Learning architecture of ResNet (above) and the Resblocks (below)
The resulting model was able to predict the values on the validation set pretty well. To make sure that the end model is good at generalization we split the train set into 10 different parts of equal size. Then we trained 10 different models. For each model one (different) part of the split was held out as a validation set. These models were then combined to make an ensemble. The predicted values from the ensemble can be seen in Figure 4. The red points are points that would be wrongly classified when using the three classes described above. As you can see the model is able to predict the NT-proBNP values pretty well, but still a lot of classification errors would be made when using this model.
Figure 4: Results of an Ensemble of Deep Learning ResNet regression models
LSTM: Recurrent neural network for a classification problem
We started out by creating a basic LSTM network for classification that could serve as a baseline architecture for performing both multi-class (3-class) and binary class classification of NT-proBNP. The network consists of 2 LSTM layers and 2 dense layers. The input to the network was a 12-ECG input lead and we also used two static features — age and gender. We did a minimum-maximum normalization on the input features, and we applied a Finite Impulse Response filter on the 12-ECG input lead. We used the Adam optimizer for both models. In the multi-class model, the categorical cross entropy loss and a batch size of 32 was used. We also used the l2 kernel and recurrent regularizer to prevent overfitting.
In the binary-class model, we set the classes as BNP <=200 are good values and BNP >200 are bad values. We used 200 to try to balance the data. To handle imbalanced data we did a bias initialization as suggested by other works.
Which model won the game? Did any?
Classical machine learning did a bit better than chance (Figure 1). Majority class was 80% while the model got 82%. When doing regression with (ensemble of) ResNet(s) one can see why the models probably “fail” to be much better than chance. A lot of points lay around the decision boundary (of 125 and 450) and thus it’s logical that some confusion for those points arises. Since there are no big differences between the points near the boundary, the binary classification seems to fare better. To make the model practical, we’d suggest using the following classification: BNP < 125, BNP >= 125 or uncertain.
After weeks of experimentation with both, classical machine learning and complex convolutional neural networks, we’ve realized that even the models that fared well for the high-risk group delivered worse outcomes for lower-risk estimates. This could be improved with adding more training data of healthy patients to the balance. But the decision has to be made by the physicians using the models.
The doctors might be interested in having more certainty about the BNP>=125 class (high risk) and not caring too much about the lower class, because the risk of heart attack is lower. For instance, if we use the binary classification of classical machine learning (Figure 1) and someone is predicted as high risk, we will know that the chance of being BNP>=125 is very high indeed (87%). If the model classifies someone as BNP< 125; the odds of being properly classified is only 62% and it might be a False Negative. We would suggest using the model for positive cases only. But again, this decision has to be made by the physicians.
We would love to see the project developed further in the future. The models could be improved with more useful data; tested in practice and adjusted based on the results. We hope we inspired the hospital to develop more AI-based solutions to the challenges they’re facing.
A surprising effect of a real-world AI project
The non-technical aspects of this machine learning project were the most surprising. In the real world, it’s not only about the highest possible model accuracy. It’s important to make sure that the model can be used in practice and involve the users (doctors in our case). Using data from the real world, experimenting with it and drawing often unexpected conclusions from it was a new experience for many of us.
There was a ‘first time’ for everyone. A few participants with some experience from AI Bootcamps never faced a Challenge before. Others had experience in deep learning, but never explored biomedical signals. Some had theoretical knowledge in processing biomedical signals and programming skills, but no experience with real data. The AI for Heart Failure Detection Challenge was a very unique learning opportunity for all of us. We are grateful for the chance to share findings, drawbacks, excitement (not gonna lie, even frustration) with such a diverse group, professionally and culturally.
Andrea Faúndez, Gerson Foks, Laura Didden, Sri Aravind
AI for Health engineers
AI for Heart Failure Detection Team: Alessio Nespoli, Nasia Athanasiadou, Sri Aravind, Tessa van Beers, Yastika Joshi, Andrea Faúndez, Gerson Foks, Ruben Cuervo, Akshay Kalra, Aryan Ashar, Bruhanth Mallik, Feyisayo Olalere, Janet Zheng, Laura Didden, Simon Penninga