Building a Predictive Model for Chronic Heart Disease Using Logistic Regression

“Some things are so unexpected that no one is prepared for them. “— Leo Rosten in Rome Wasn’t Burned in a Day

Sheik Jamil Ahmed
DataDuniya
6 min readJun 22, 2023

--

Photo by National Cancer Institute on Unsplash

According to World Health Organization, Cardiovascular diseases (CVDs) are the leading cause of death globally, taking an estimated 17.9 million lives each year. CVDs are a group of disorders of the heart and blood vessels and include coronary heart disease, cerebrovascular disease, rheumatic heart disease and other conditions.More than four out of five CVD deaths are due to heart attacks and strokes, and one third of these deaths occur prematurely in people under 70 years of age.

The above global emergency is the motivation of this article to Predict Heart disease using Logistic Regression in a step-by step approach.

  1. Dataset

The Dataset that will be used in this article will be taken from kaggle. Dataset Link- https://www.kaggle.com/datasets/johnsmith88/heart-disease-dataset

Click on the link and download the Heart Disease Dataset.

2. Google Colab

Go to the new Tab in the browser and type “Google colab” and create a “New Notebook”.

Next upload the dataset in the notebook and follow the step to build your own predictive heart disease model.

3. Importing Libraries

The following libraries are imported to use them in our project: numpy, pandas and from sklearn we will import logistic regression, train test split and accuracy score. The code is given below

import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

4. Reading the Dataset

The uploaded dataset is read using pandas with the following codes

#reading the csv file in the pandas dataframe
hdata=pd.read_csv('/content/heart_disease_data.csv')

5.Printing the Dataset

The hdata contains the dataframe. So, let us view the first five row of the dataset inorder to view the columns and the data it contains.

#Printing the first 5 rows of the dataset
hdata.head()

The columns “target” is the label where value 0 indicates Normal Heart and 1 indicates Diseased Heart.

There are 13 features in the dataset such as age, sex, cp, trestbps, chol, fbs, restecg, thalach, exang, oldpeak, slope, ca and thal.

The Attribute Information is provided below

Similarly, Let us explore the last five row of the dataset. The code to view the last five rows are given below

#Printing the last 5 rows of the dataset
hdata.tail()

Finding the Shape of the Dataset

Now let’s check out the shape of the dataset which means the number of rows and the columns of the dataset.

#Finding the shape of the dataset
hdata.shape

The output is 303 rows or samples and 14 columns or variables.

6. Dataset Information

Let’s drill more to find the Information about the Datasets. This is achieved by the following code

#Getting Information of the dataset
hdata.info()

The above output can be interpreted as :

  • There are 303 rows which is indexed from 0 to 302.
  • There are total of 14 Data Columns.
  • All the features are non-null which means that there is “no missing value” in the dataset.
  • The data types of the features are provided. Here 13 columns are int64 data types and 1 column is float64 data types
  • The memory usage information is also provided.

There is an alternative way to check the missing value. The code is provided below

#Checking the missing values
hdata.isnull().sum()

The output conforms that there is no missing values in any of the features of the dataset.

Let us review the Statistical information of the dataset using the following code.

#statistical information about the dataset
hdata.describe()

The output indicates the statistical information such as count, mean, standard deviation, minimum value, Quartile-1,2,3 and maximum values of all the features of the dataset. This information provides a good insight into the dataset.

Example- The above output shows that the minimum age is 29 years and maximum value is 77 years. We can also conclude that 50% of the data has less than 55 years age. The mean age is 54.36 years and also the spread of the data can be seen from the above output.

7. Checking for Class Distribution

We need to also check the distribution of the class so that only one type of class doesnot dominate the dataset. In otherwords, we see whether it is not the case of “class imbalanced” dataset.

#checking the distribution of the Target variable
hdata['target'].value_counts()
Output

The output shows that there are 165 samples of Disease Heart and 138 samples of Normal Heart.

8. Splitting the Dataset into Features and Label

The dataset is partitioned into X and y where X contains all features data and y contains the class label data.

X=hdata.drop(columns='target',axis=1)
Y=hdata['target']

In the above code :

  • X contains all features except the columns “target” as it is dropped using the drop() and axis=1 indicates the column.
  • y is assigned the column “target”.

Lets see the values of X and y

X
X contains all Features except the target
y
y contains the target

9. Splitting X and y in Training and Testing Dataset

The data X and y is split in X_train,X_test,Y_train and Y_test using the train_test_split() function. Here, we are considering that 80% of the dataset will be for Training and 20% of the dataset will be for Testing which is set as test_size=0.2

X_train,X_test,Y_train,Y_test=train_test_split(X,Y, test_size=0.2,stratify=Y,random_state=2)

The shape of training and testing data is displayed as follows

print(X_train.shape,X_test.shape)

So, there are 242 samples used for Training and 61 samples for testing

10. Model Training using Logistic Regression

Now its time for making a Logistic regression model and fitting the training data into the model. The codes are given below:

model=LogisticRegression()
model.fit(X_train,Y_train)

11. Model Prediction

Now is the testing time of the model. That means it has to show its calibre in prediction for which it is developed. Here, model.predict() function is used to perform the prediction. The code is given below

y_predict=model.predict(X_test)

12. Model Evaluation- Accuracy

The prediction is performed above and its time to see how accurately the model has performed its task. The accuracy is found using the code below

accuracy=accuracy_score(Y_test,y_predict)
accuracy

The output indicates the accuracy of 81.96% which is fairly good. Anything above 75% is considered a good model.

Note- One friendly advise is to practise the code provided as you go through this article.

Building a Predictive System

Before we come to the end of this project, its real challenge is to predict with accuracy when a new data is provided.

In the following codes, 13 values will be provided and lets see the prediction of the model

inputData=(75,0,2,145,233,1,0,150,0,2.3,0,0,1)
input_array_data=np.asarray(inputData)
input_data_reshaped=input_array_data.reshape(1,-1)

prediction=model.predict(input_data_reshaped)

if (prediction[0]==1):
print('The Person has a Heart Disease')
else:
print('The Person does not have Heart Disease')
The Person has a Heart Disease

The data is taken inputData=(75,0,2,145,233,1,0,150,0,2.3,0,0,1) which is the value of 13 features and the model predicts that “The Person has a Heart Disease”.

You can provide your own data in the inputData and check out what the model predict.

Follow me for more such article on Machine Learning and Deep Learning.

--

--

Sheik Jamil Ahmed
DataDuniya

I write about Python, Machine Learning, Deep Learning, NLP, Image Processing and Technical related stuffs