From Data to Decisions: Transforming Cardiovascular Care through Predictive Analytics
Research Question
How do objective, examination, and subjective features contribute to the prediction of cardiovascular disease, and what patterns can be identified to improve early detection and prevention strategies?
Goals, Objectives, and Deliverables
The execution closely mirrored the Agile methodology, with iterative development cycles allowing for continuous refinement of predictive models based on the data-driven insights obtained during each sprint phase.
Goal:
- Analyze how objective, examination, and subjective data contribute to predicting cardiovascular disease.
Objectives:
- Identify the most predictive features for cardiovascular disease.
- Determine the patterns and relationships among the features that can improve early detection.
Deliverables:
- A predictive model identifying key features and patterns associated with cardiovascular disease risk.
- A comprehensive report detailing the analysis, findings, and recommendations for preventive strategies.
Data cleaning/Preparation
Exploring Data
Distribution of the discrete data columns
- Age: Most individuals are between 50 and 55 years old.
- Gender: There is a significant imbalance, with one gender category being much more frequent.
- Height: Majority of the heights are clustered between 150 cm and 200 cm.
- Weight: Weight distribution peaks between 60 kg and 100 kg, with few outliers above 150 kg.
- Systolic Blood Pressure: Data is heavily skewed towards very low values, indicating possible data quality issues.
- Diastolic Blood Pressure: Similarly, values are clustered at the low end, suggesting potential inaccuracies.
- Cholesterol: Most individuals have a cholesterol level coded as 1, with fewer individuals in higher categories.
- Glucose: Similar to cholesterol, the majority of individuals have a glucose level coded as 1, with some in higher categories.
- Smoking: The smoking status shows a high number of non-smokers (coded as 0) compared to smokers (coded as 1).
- Alcohol Consumption: Most individuals do not consume alcohol (coded as 0), with a small number indicating alcohol consumption (coded as 1).
- Physical Activity: A large portion of individuals are physically active (coded as 1), with fewer individuals not being active (coded as 0).
- Cardiovascular Disease: The dataset indicates a balanced distribution between individuals with and without cardiovascular disease.
- Glucose: Similar to cholesterol, the majority of individuals have a glucose level coded as 1, with some in higher categories.
- Smoking: The smoking status shows a high number of non-smokers (coded as 0) compared to smokers (coded as 1).
- Alcohol Consumption: Most individuals do not consume alcohol (coded as 0), with a small number indicating alcohol consumption (coded as 1).
- Physical Activity: A large portion of individuals are physically active (coded as 1), with fewer individuals not being active (coded as 0).
- Cardiovascular Disease: The dataset indicates a balanced distribution between individuals with and without cardiovascular disease.
Distribution of the continuous data columns
- Age: Median is around 55 years, with a few outliers below 35 years.
- Height: Median is about 165 cm, with many outliers both above and below, indicating possible data entry errors.
- Weight: Median is around 75 kg, with numerous high-end outliers exceeding 150 kg, suggesting significant variance in the data.
- Systolic Blood Pressure: Most values are clustered at the lower end, with numerous outliers extending up to 16000, indicating possible data entry errors or extreme measurements.
- Diastolic Blood Pressure: Similar to systolic, most values are concentrated at the lower end, with many outliers up to 10000, suggesting data anomalies.
- Cholesterol: The median value is around 1.5, with the interquartile range spanning from 1 to 2.5, showing a more typical distribution without extreme outliers.
Correlation matrix of the columns
Here are the key highlights:
- Age shows a positive correlation with cardiovascular disease (0.24) and a moderate positive correlation with systolic blood pressure (0.15).
- Gender has a moderate positive correlation with height (0.50) and smoking (0.34).
- Height is moderately correlated with weight (0.29).
- Systolic and Diastolic Blood Pressure are highly correlated with each other (0.99), indicating a strong relationship between these two measures.
- Cholesterol shows a moderate correlation with glucose (0.45).
- Smoking is positively correlated with gender (0.34) and alcohol consumption (0.34).
- Cardiovascular Disease is positively correlated with age (0.24), systolic blood pressure (0.22), and glucose levels (0.22).
Prediction
# Splitting the dataset into features and target variable
X = data.drop(columns=['id', 'cardio_disease']) # excluding 'id' as it is not a feature
y = data['cardio_disease']
# Standardization
scaler_std = StandardScaler()
X_standardized = scaler_std.fit_transform(X)
# Keep the column names for later reference
feature_names = X.columns.tolist()
# Convert standardized data back to DataFrame for interpretability
X_standardized_df = pd.DataFrame(X_standardized, columns=feature_names)
# Splitting the standardized data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_standardized_df, y, test_size=0.2, random_state=0)
Split the data with scaled-features
Model test and selection For predicting cardiovascular disease based on a dataset with a mix of objective, examination, and subjective features, the following four models could be considered among the best due to their ability to handle various types of data and capture complex relationships:
Random Forest Classifier: This ensemble model is excellent for handling a mix of numerical and categorical features. It works well for classification tasks and can handle high-dimensional data and feature interactions without extensive data preprocessing.
Gradient Boosting Classifier: Another powerful ensemble method, gradient boosting, can improve prediction accuracy by sequentially adding weak learners to correct the errors of the combined ensemble. It effectively captures complex patterns in the data and deals with imbalanced datasets.
Logistic Regression: As a fundamental statistical model for binary classification, logistic regression is valuable for understanding the relationship between the target and features due to its interpretability. It can provide insight into the odds of having cardiovascular disease based on the input features.
# load and train the model
rf = RandomForestClassifier()
rf.fit(X_train,y_train)
# Display the train and test scores
print('Train Accuracy: ', rf.score(X_train,y_train))
print('Test Accuracy: ', rf.score(X_test,y_test))
Train Accuracy: 0.9997857142857143
Test Accuracy: 0.7120714285714286
# load and train the model
gbc = GradientBoostingClassifier()
gbc.fit(X_train,y_train)
# Display the train and test scores
print('Train Accuracy: ',gbc.score(X_train,y_train))
print('Test Accuracy: ', gbc.score(X_test,y_test))
Train Accuracy: 0.7396785714285714
Test Accuracy: 0.7349285714285714
# load and train the model
lr = LogisticRegression(max_iter=1000)
lr.fit(X_train,y_train)
# Display the train and test scores
print('Train Accuracy: ',lr.score(X_train,y_train))
print('Test Accuracy: ', lr.score(X_test,y_test))
Train Accuracy: 0.723625
Test Accuracy: 0.7215714285714285
Create an Evaluation Function and split the features into categories
# Defining the feature categories
objective_features = ['age', 'height', 'weight', 'gender']
examination_features = ['systolic_b_pressure', 'diastolic_b_pressure', 'cholesterol', 'glucose']
subjective_features = ['smoke', 'alcohol', 'physically_active']
# Function to evaluate a model
def evaluate_model(features, model):
"""
Evaluate the performance of a machine learning model on a specified set of features.
Parameters:
features (list): A list of column names from the dataset to be used as features for the model.
model (model object): The machine learning model to be evaluated, instantiated outside this function.
The function trains the model on a subset of the dataset defined by the specified features and then
evaluates its performance on a separate test set.
Returns:
dict: A dictionary containing the following key-value pairs representing the model's performance metrics:
- 'accuracy': The accuracy of the model on the test set.
- 'precision': The precision of the model on the test set.
- 'recall': The recall of the model on the test set.
- 'f1': The F1 score of the model on the test set.
- 'auc': The area under the ROC curve for the model on the test set.
"""
model = model
model.fit(X_train[features], y_train)
predictions = model.predict(X_test[features])
#probability scores of the positive class
probabilities = model.predict_proba(X_test[features])[:, 1]
return {
'accuracy': accuracy_score(y_test, predictions),
'precision': precision_score(y_test, predictions),
'recall': recall_score(y_test, predictions),
'f1': f1_score(y_test, predictions),
'auc': roc_auc_score(y_test, probabilities),
}
# Function to evaluate a model
def print_score(model):
"""
This function evaluates the given model on different sets of features: objective, examination,
subjective, and a combination of all. It then prints out the performance metrics for each feature set.
Parameters:
model (model object): The machine learning model to be evaluated. It should already be instantiated
and capable of fitting data and making predictions.
The function calls `evaluate_model` for each set of features and prints the results, which include
accuracy, precision, recall, F1 score, and AUC metrics.
Returns:
None: This function does not return anything but directly prints the evaluation results.
"""
# Evaluating models based on different feature sets
objective_results = evaluate_model(objective_features,model)
examination_results = evaluate_model(examination_features,model)
subjective_results = evaluate_model(subjective_features, model)
combined_results = evaluate_model(objective_features + examination_features + subjective_features, model)
# Display the scores
print(f"'Objective_results:\n',{objective_results}'\n\n', 'Subjective_results:\n'{subjective_results}'\n\n', 'Examination_results:\n'{examination_results}'\n\n', 'Combined_results:\n'{combined_results}")
Grid Search
This approach was strategic and aimed at leveraging the strengths of each model to handle the complexities of predicting cardiovascular diseases. These conditions often present nonlinear relationships and interactions between features, necessitating robust models that can interpret such complexities effectively. The initial standalone models showed promising results, with Random Forest achieving a high training accuracy but a lower test accuracy, indicating overfitting.
Random Forest Model
Best Parameters: ‘max_depth’: 10, ‘n_estimators’: 100
results:
Objective_results: ‘accuracy’: 0.6215, ‘precision’: 0.6143818334735072, ‘recall’: 0.6323762804790074, ‘f1’: 0.623249200142197, ‘auc’: 0.66849463679522
Subjective_results: ‘accuracy’: 0.5197857142857143, ‘precision’: 0.5337881741390513, ‘recall’: 0.23705093060164478, ‘f1’: 0.328304525926666, ‘auc’: 0.5200202309452965
Examination_results: ‘accuracy’: 0.7272142857142857, ‘precision’: 0.7521062864549579, ‘recall’: 0.6697446255951522, ‘f1’: 0.7085400290009922, ‘auc’: 0.773885968797907
Combined_results: ‘accuracy’: 0.7352142857142857, ‘precision’: 0.7600838980316231, ‘recall’: 0.6796998990044727, ‘f1’: 0.7176479549089801, ‘auc’: 0.8017637795378445
Gradient Boost Classifier
Best Parameters: ‘learning_rate’: 0.1, ‘max_depth’: 3, ‘n_estimators’: 200
results:
Objective_results: ‘accuracy’: 0.6236428571428572, ‘precision’: 0.6180062482249361, ‘recall’: 0.6279036214110518, ‘f1’: 0.6229156229871896, ‘auc’: 0.6715617613376679
Subjective_results: ‘accuracy’: 0.5197857142857143, ‘precision’: 0.5337881741390513, ‘recall’: 0.23705093060164478, ‘f1’: 0.328304525926666, ‘auc’: 0.5200202309452965
Examination_results: ‘accuracy’: 0.7273571428571428, ‘precision’: 0.7552459016393442, ‘recall’: 0.6646948492281056, ‘f1’: 0.7070831095080959, ‘auc’: 0.7745590239900658
Combined_results: ‘accuracy’: 0.7357142857142858, ‘precision’: 0.7528564720613554, ‘recall’: 0.6939835521569759, ‘f1’: 0.7222222222222222, ‘auc’: 0.8026271083196471
After progressively testing various models such as Random Forest, Gradient Boosting Classifier, and Logistic Regression (with and without cross-validation), and conducting a grid search to fine-tune the parameters for both Random Forest and Gradient Boosting Classifier, the chosen model was GBC.
Feature Importance
The feature importance plot highlights the most significant predictors of cardiovascular disease in the dataset:
- Systolic Blood Pressure: The most influential feature.
- Cholesterol: The second most important predictor.
- Age: Also significantly contributes to predicting cardiovascular disease.
- Diastolic Blood Pressure: Has a smaller but notable impact.
- Other features such as weight, physical activity, smoking, height, gender, alcohol consumption, and glucose have lesser importance.
These insights help identify which factors are most critical for the model in predicting cardiovascular disease.
Area Under the curve
The Gradient Boosting Classifier demonstrated a more balanced performance, with closer training and test accuracies. Logistic Regression provided baseline comparison, showing the necessity for more finetuned methods to capture the nuanced patterns of cardiovascular risk factors. Subsequently, the fine-tuning of the Gradient Boosting Classifier aimed to harness the individual predictive powers while mitigating overfitting and enhancing generalization to unseen data.
The performance metrics chosen for evaluation, AUC and F1 Score, were critical in providing a comprehensive assessment of each model’s ability to accurately classify individuals in terms of their cardiovascular disease risk. These metrics were specifically selected to balance the importance of both precision and recall in medical predictions.
Benchmark for Success:
AUC: The goal is ≥ 0.75, reflecting the combined model’s ability to accurately predict cardiovascular disease occurrence. F1 Score: A target of ≥ 0.70, indicating effective balance in classification performance from both constituent models.
The fine-tuned model is well-tailored for predicting cardiovascular diseases and adept at navigating the nonlinear relationships and complex interactions typical of medical data.
The evaluation of these models using AUC and F1 Score metrics offers a thorough analysis of their classification accuracy concerning cardiovascular disease risk. These metrics are crucial, as they encapsulate both precision and recall, providing a balanced view of model performance in medical diagnostics.
Results
The results from the Gradient Boosting Classifier underscore this suitability. For the combined feature set, the model achieved an AUC of 0.8026 and an F1 Score of 0.7222, indicative of its strong predictive capability and balanced precision-recall trade-off. Similarly, examination features alone resulted in an AUC of 0.7746 and an F1 Score of 0.7071, further affirming the model’s effectiveness.
In contrast, objective and subjective features yielded lower performance, with AUCs of 0.6716 and 0.5200 respectively, highlighting the increased predictive power when leveraging a comprehensive set of features. These benchmarks validate the model’s efficacy, particularly when utilizing a holistic approach that integrates various data types, leading to superior prediction accuracy.
Thus, the model not only excels in individual assessments but also demonstrates the enhanced performance of the fine-tuned Gradient Boosting Classifier, promising, reliable, and actionable insights for cardiovascular disease prediction and management.
Practical Significance
Clinical Impact: The model’s practical significance will be evaluated by its ability to enhance early detection of cardiovascular diseases, thereby facilitating timely medical interventions. A significant reduction in late-stage diagnosis rates of CVD among the screened population would demonstrate the model’s practical value.
For example, if the model is integrated into routine health check-ups, its effectiveness can be measured by the increased rate of early-stage CVD detection and the corresponding improvement in patient management and treatment outcomes.
Healthcare Cost Reduction: Another crucial aspect of practical significance is the model’s impact on healthcare costs. By preventing advanced stages of cardiovascular diseases through early intervention, the model should lead to a noticeable decrease in the financial burden associated with CVD treatments, such as hospital admissions, surgeries, and long-term care.
This cost reduction can be quantified by comparing the healthcare expenses incurred before and after implementing the predictive model in clinical practice.