Coronary Artery Disease Prediction
Heart disease is the leading cause of death. In the US, around 659,000[1] and in Canada 77,000 people die from heart disease each year. The spending on Heart disease costs the United States about $363 billion annually[2] and Canada 22 billion annually.
1 Background
American College of Cardiology and American Heart Association (ACC/AHA) 10-year cardiovascular risk calculator has been challenged for its accuracy by several analyses(Lancet 2013; 382:1762 and JAMA Intern Med 2014; 174:1964). Researchers used data from the MESA(Multi-Ethnic Study of Atherosclerosis) study proved that Framingham-based risk scoring systems and the ACC/AHA calculator risk equation substantially overestimated actual 5-year risk in adults without diabetes, overall and across socio demographic subgroups.[3]. Since the calculator is used to select patients for statin therapy, the implications of inaccuracy are substantial.
Thus, the potential of utilizing machine learning to improve prediction of cardiovascular disease and make better medical decisions is significant.
2 Data and methods
2.1 Data source
Data used in this case is the Cleveland Heart Disease dataset from the UCI Repository.
2.2 Methods:
We applied various machine learning methods to unmask the relationship between certain attributes and heart diseases. Machine learning algorithms we used includes:
- Naives bayes
- KNN
- Decision Tree
- SVM
- XGB
- VotingClassifier
- Logistics regression
- Random Forest
The process of our study is shown as the following flow chart:
3, Result and Feature importance:
3.1 Machine learning models accuracy
A few models show decent accuracy as shown below. With hyperparameters tuned, Logistics regression and random forest have 88% and 84% accuracy on the test dataset.
3.2 Feature importance:
Feature importance refers to techniques that assign a score to input features based on how useful they are at predicting a target variable.
According to RF, out of 30 variables, the top 5 important features are:
- Cp0:Typical angina: chest pain related decrease blood supply to the heart
- Oldpeak: ST depression induced by exercise relative to rest — looks at stress of heart during exercise, unhealthy heart will stress more.
- Exang1: exercise induced angina (True)
- thalach — maximum heart rate achieved
- exang0 — exercise induced angina (False)
The least useful variables includes:
- Thal_0: thalium stress result
- Ca_4: ca empty value
- Fbs_0:(fasting blood sugar > 120 mg/dl) (false)>126' mg/dL signals diabetes
- Fbs_1:(fasting blood sugar > 120 mg/dl) ( true)>126' mg/dL signals diabetes
- Restecg_2: having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV).
4, Conclusion
Kardiolabs is developing Artificial intelligence based solutions for automated reporting of CT Coronary Angiogram for patients suffering from coronary artery disease. For this study, we have experienced cardiologists in the team to advise on machine learning methods. Next step, more features and records will be introduced to further improve the prediction.
Appendix:
- age: age in years
- sex: sex (1 = male; 0 = female)
- cp: chest pain type
— Value 0: typical angina
— Value 1: atypical angina
— Value 2: non-anginal pain
— Value 3: asymptomatic - trestbps: resting blood pressure (in mm Hg on admission to the hospital)
- chol: serum cholestoral in mg/dl
- fbs: (fasting blood sugar > 120 mg/dl) (1 = true; 0 = false)
- restecg: resting electrocardiographic results
— Value 0: normal
— Value 1: having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV)
— Value 2: showing probable or definite left ventricular hypertrophy by Estes’ criteria - thalach: maximum heart rate achieved
- exang: exercise induced angina (1 = yes; 0 = no)
- oldpeak = ST depression induced by exercise relative to rest
- slope: the slope of the peak exercise ST segment
— Value 0: upsloping
— Value 1: flat
— Value 2: downsloping - ca: number of major vessels (0–3) colored by fluoroscopy, 4, NAN
- thal: 0 = normal; 1 = fixed defect; 2 = reversible defect
and the label - condition: 0 = no disease, 1 = disease
Blog by Mia
Reference:
[1]:Centers for Disease Control and Prevention. Underlying Cause of Death, 1999–2018. CDC WONDER Online Database. Atlanta, GA: Centers for Disease Control and Prevention; 2018. Accessed March 12, 2020.
[2]: Virani SS, Alonso A, Aparicio HJ, Benjamin EJ, Bittencourt MS, Callaway CW, et al. Heart disease and stroke statistics — 2021 update: a report from the American Heart Associationexternal icon. Circulation. 2021;143:e254–e743.