Predicting Left Ventricular End Diastolic Volume using Machine Learning

Andrew Bergman
Analytics Vidhya
Published in
7 min readSep 2, 2019

I worked on a lot of different projects during my data science immersive at General Assembly from predicting the price of homes in Ames, IA to classifying handwritten numbers with neural networks.

Data is a passion of mine, but my goal is to make data work for everybody. I had the human element in mind for this project: I wanted an idea that could impact people in a positive way, but also an idea that I could find data for.

I luckily was able to get my hands on a set of data containing cardiac MRI data from patients who had heart attacks. Don’t worry, the data was de-identified before I received it.

Of all the 48 features in the data, the one that stuck out the most was the left ventricular end diastolic volume (EDV from here on), which is the volume of the left ventricle when it has finished filling with blood. The left ventricle (LV) is the chamber of the heart responsible for pushing oxygenated blood out into systemic circulation.

GIF illustration of the heart beating
GIF illustration of the heart beating. The heart is mirrored so the LV is on the right side.

The Problem

The LV cannot expand like a balloon: to allow for increased volume the muscular wall has to thin. Once the muscular wall thins, the LV cannot pump efficiently which causes a whole host of problems. The LV’s function is also an indicator of overall cardiac function. If a model can predict the EDV accurately, it could help cardiologists determine who needs help the most & improve efficiency in the healthcare system.

Preprocessing

Before I was able to jump right into modeling, I had to process the data. Luckily for me, the data was very clean: only four columns had significant missing values. I spent a lot of time reading about data imputation because it is a very touchy subject: if you do it incorrectly, you can dramatically skew your data (a topic for a future blog post). I ended up using fancyimpute’s KNN imputer because the missing data was discrete and ordinal.

Apart from imputing data, I just had to make sure ordinal text was on a numeric scale.

My data has 48 features including the target variable, the EDV. 34 of those features represent either scarification or ischemia (reduced blood flow) in each of the 17 segments of the heart.

Case courtesy of Dr Craig Hacking, Radiopaedia.org, rID: 68467

Because there are so many features, I wanted to see if I could reduce their number through feature engineering. Using a diagram like the one above, I planned interaction columns based on the ring layers & numbered sections. However, my ability to create interaction columns was hampered because the vast majority of my values are 0, indicating no damage. That being said, I was able to create three “summary” columns consisting of the sum of the basal, mid, and septal regions (the top, middle, and bottom of of the ventricle respectively). Once I did that, I was able to start modeling.

Modeling

import pandas                as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import xgboost as xgb
from math import sqrt
from sklearn.linear_model import LinearRegression, LassoCV
from sklearn.linear_model import RidgeCV, ElasticNetCV
from sklearn.metrics import r2_score, mean_squared_error
from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestRegressor
from sklearn.pipeline import Pipeline

My imports are fairly standard & I included three model types: linear, tree, & boosting models.

Because of the way I engineered my features I had to create two sets of data: one with the original features & a second with the engineered features. When I modeled, each model was run on both sets of data. In terms of model evaluation, I compared model performance on each set of features.

I started with a linear regression because that is the simplest regression model. I knew that if the model did poorly, I would have to switch to a different model type. However, the linear regression did surprisingly well.

From the linear model, I decided that my next step was to try regularized models. The benefit of regularized models is that they reduce features the algorithm determines are unimportant by shrinking them towards or actually to 0. I ran three regularized models: Ridge, Lasso, & ElasticNet. Surprisingly, all of the models performed worse than the linear regression.

The yellow points are most easily seen because virtually all of the predictions are stacked on top of each other.

It is easy to see where the linear models fail: the model with original features over-predicts low values and its variance increases as the actual values increase whereas the models with engineered features under-predict on low and high values.

For that reason, I decided to move onto the other 2 model types.

A random forest is a tree model: it uses a decision tree to determine to predict values. However, it is an improvement over the standard decision tree because it incorporates two levels of randomness: it bootstraps (random selection with replacement) rows & then it chooses a random set of features. I chose this model over other tree types because of the random choice of features: there are 48 features in the data set with original features. The one downside to the way the random forest was set up is that I ran a GridSearch, to search many combinations of hyperparameters (parameters I set), but doing so does not allow for the extraction of feature importance & thus removes interpretation.

An XGBoost regression model, or Extreme Gradient Boost, is a boosting model which fits an initial weak linear model and then iteratively fits weak models onto the residuals. However, XGBoost incorporates regularization with both L1 and L2 algorithms to help minimize overfitting. I chose not to run a GridSearch on the XGBoost model because of how the two models performed.

It is readily apparent that these two models are significantly better than any of the linear models. Furthermore, they performed better with the engineered features.

Our best model was the random forest regression because it had the best metric scores, though the XGBoost model was a close second. Additionally, we are able to extract feature importance from the model which allows us to have an idea of what the most important features are in the data.

  • lvesv_log is the logged values of the end systolic volume (volume at the end of contraction);
  • lvef is the ejection fraction or how much blood is pumped;
  • sex is the subject’s sex;
  • aortic_reg is a measure of blood flow back across the aortic valve
  • apical_ischemia is a measure of blood flow in the apical region

The systolic volume has a near 1:1 relationship with the diastolic volume, so the magnitude of that coefficient is not surprising. Additionally, the ejection fraction is the ratio of the diastolic and systolic volume so it isn’t suprising that the ejection fraction is the second strongest. What is surprising is how insignificant can the next three most important features are: they aren’t even visible on the graph. This is especially surprising to me because XGBoost model with engineered features has more, stronger coefficients.

Conclusions & Looking Forward

The non-linear models are by far the best, but there is still more work to be done. This is health related data so I am still not entirely satisfied with how well the models predicted, but that will come with further tuning of the models. On the same note, feature engineering needs to be further flushed out because that improved the performance of the models.

Another facet I want to improve on is the data itself. It is all well and good to predict the EDV from MRI data, but to really improve efficiency in the system I need to predict from data from patient charts. That being said, HIPAA will make it difficult to get that kind of data.

I hope I didn’t ramble on for too long about this project! I loved working on it because it was meaningful to me & has real world applications.

The repository for this project can be found here.

I can also be reached on LinkedIn.

--

--