Prediction Model of Heart Disease With Logistic Regression.
1. Introduction
The WHO (World Health Organization) gave an estimate of 12 million deaths occur worldwide, due to heart disease. In Brazil, more than 289 thousand people died of cardiovascular diseases in 2019 according to the Cardiometer platform, of the Brazilian Society of Cardiology (SBC).
1.1 Problem
With the arrival of the pandemic called Covid-19, hospitals are experiencing a phenomenon of overcrowding by people infected with Covid-19 and other diseases that already existed in large numbers present in hospitals, among those hospitalized with Covid-19 are included the risk groups, which are: Diabetics, Hypertensive, People with heart problems.
2. Solution
In view of the above, I decided to develop a model with machine learning, in order to classify the prediction if the patient has a 10-year risk of future cardiovascular disease. Through the use of logistic regression.
Logistic regression is a regression analysis model widely used in mathematical/statistical models that are generally used to predict the outcome of a categorical dependent variable from a set of predictor or independent variables. In logistic regression, the dependent variable is always binary. Logistic regression is used mainly for prediction and also to calculate the probability of success.
2.1 Data Preparation
The data set used in the sample is publicly available on the Kaggle website, and is a study by residents of the city of Framingham, Massachusetts.
For ease of use and reuse, I put it in a repository on my github.
https://github.com/carlospy98/datasets-tests/blob/master/framingham.csv
2.2 Dataset
Selection of dependent and independent variables.
dependent variable: TenYearCHD
independent variables: age, Sex_male, cigsPerDay, totChol, sysBP glucose.
Below is the number of columns and the sample data dictionary.
Sex_male: male or female; Age:age of the patient; education: level of education; currentSmoker: whether or not the patient is a current smoker; cigsPerDay: the number of cigarettes the person smoked on average in one day; BPMeds:whether or not the patient was on blood pressure medication; prevalentStroke: whether or not the patient had previously had a stroke; prevalentHyp: whether or not the patient was hypertensive; diabetes: whether or not the patient had diabetes; totChol: total cholesterol level; sysBP: systolic blood pressure; diaBP: diastolic blood pressure; BMI: Body Mass Index; heartRate: heart rate; glucose: glucose level; TenYearCHD: Predict variable, 1 means “Yes”, “0” means “No”.
2.3 Data cleaning
First, I checked if there was null data in the sample.
Then, I checked the percentage of null data in the sample, being only 14%, I decided to remove the null data.
2.4 Eda && Data normalization
I plotted the values to identify the lack of normalization of the data had the normalized behavior.
Using the scipy boxcox lib, I normalized the age, for Gausian distributionUsing the scipy boxcox lib, I normalized the age, for Gausian distribution.
3. Create Model
When using the logistic regression model, I obtained an accuracy of 86%
3.1 Confusion Matrix
In the field of machine learning and specifically the problem of statistical classification, a confusion matrix, also known as an error matrix. A confusion matrix is a table that is often used to describe the performance of a classification model (or “classifier”) on a set of test data for which the true values are known. It allows the visualization of the performance of an algorithm.
It allows easy identification of confusion between classes e.g. one class is commonly mislabeled as the other. Most performance measures are computed from the confusion matrix.
source: https://www.geeksforgeeks.org/confusion-matrix-machine-learning/
Result of the confusion matrix:
4. Conclusion
Men seem to be more susceptible to heart disease than women. Increasing age, number of cigarettes smoked per day and systolic blood pressure also show increasing chances of heart disease.
The attributes selected after the selection process of the dependent and independent variables, showed values below 5%, thus, having a significant role in the prediction of heart diseases.
The model predicted with an accuracy of 0.86. The model is more specific than sensitive.
The general model can be improved with more data.
5. Technology
Technologies and lib’s: Python3, pandas, scipy, seaborn, sklearn, numpy.
I wish you all a great reading and thank you. 😊
6. Contact
Linkedin: https://www.linkedin.com/in/carlos-barbosa-046a9716b/
Github: https://github.com/carlospy98
E-mail: carlosdspy@gmail.com
Instagram: @carlosb.py