Predicting Heart Failure Using Machine Learning, Part 1

Andrew A Borkowski
Analytics Vidhya
Published in
4 min readSep 26, 2020

Random Forrest vs XGBoost vs fastai Neural Network

Photo by Robina Weermeijer on Unsplash

Every year cardiovascular diseases kill millions of people worldwide. They are disorders of heart and blood vessels and are divided into heart attacks (caused by occlusion of heart vessels), strokes (caused by occlusion or breakages of brain vessels), and heart failure (caused by the inability of the heart to pump enough blood to the body). Since severe heart failure can lead to patient death, it is very important to predict it in advance based on the patient’s clinical and laboratory data.

To assess if I can make such predictions, I used the Kaggle’s heart failure dataset, originally published in the BMC journal article.

Chicco and Jurman BMC Medical Informatics and Decision Making
https://doi.org/10.1186/s12911-020-1023-5

The above table describes the features of clinical and laboratory data provided in the dataset. The dependent variable (target) is “death event,” which is coded 1 if the patient died or 0 if the patient survived during the follow-up period.

Let’s load the dataset into the Pandas’ DataFrame and look at the data.

I dropped the “time” variable since the follow-up period does not really represent the patient’s clinical or laboratory data. The data is already preprocessed for categorical variables, so I did not have to do it.

Since the target is binary (death or survival), I trained machine learning classification models to make predictions. I started with a fastai neural network, then used a Random Forest and ended up with an XGBoost classifier. In the end, I did ensembling of all models’ predictions into one. Jupyter notebooks for this post can be found here.

  1. Neural Network

In short, to prepare data for the neural network, I declared continuous variables, categorical variables, and dependent variable. I split the data into training and validation using the fastai RandomSpliter class and preprocessed the data with Categorify, FillMissing, and Normalize classes.

Next, I created the TabularPandas dataset object, which I passed to the fastai dataloaders. Finally, I created a machine learning model with default settings of two deep linear layers of 200 and 100 neurons, respectively.

Using lr_find, I picked 0.001 for the learning rate. After training the model for only 10 epochs, the model achieved 78% validation accuracy. If you desire to dive deeper into the fastai tabular neural networks, I strongly recommend Jeremy Howard’s 2020 Deep Learning for Coders course.

2. Random Forest

The fastai library has a nice way of creating training, validation datasets and labels from the TabularPandas dataset object (line 20). The random forest classifier was fit with training data using default settings and validated with validation data. The random forest model achieved 75% accuracy.

3. XGBoost

I used the same training and validation datasets and labels for the XGBoost classifier. The XGBoost classifier was fit with the training data with default settings. Validating the classifier with validation data resulted in 66% accuracy. The XGBoost classifier has many parameters (line 29) to set and frequently does not do well with the default settings. Some consider setting these parameters an art in itself. To keep this article’s length reasonable, I plan to write about optimizing the XGBoost classifier in a follow-up article.

4. Ensembling the performance of all three models.

There are many ways to combine the performance of machine learning models. The simplest way is to average their performance, and that is what I did. The ensemble accuracy of the three models was 73%.

Conclusion

There are many machine learning models to deal with tabular data. Random Forests are popular, easy to train, and require very little preprocessing. Gradient boosting models like XGBoost are harder to train due to complicated hyperparameters setup but can be more accurate than Random Forrest. Because of that, they are very popular in Kaggle competitions. Neural networks can take a longer time to train and require preprocessing for both training and inference. With careful optimization of hyperparameters and avoidance of overfitting, neural networks can provide good accuracy and perform well on unseen data. Since various machine learning models prioritize different features, ensembling them together can lead to better results than using each model individually.

Thank you for taking the time to read this post.

Best wishes in these difficult times,

Andrew

@tampapath

--

--