HEART DISEASE PREDICTION

Abhinav Chintale
The Startup
Published in
11 min readSep 16, 2020

ABSTRACT:

I am creating a Data Analysis Project on Heart Disease Prediction. The project uses raw data in form of a .csv file and transforms into Data Analysis. This project is an attempt of data analyzing Heart Disease Prediction with the help of data science and data analytics in python code. Heart disease is one of the biggest causes of morbidity and mortality among the population of the world. Prediction of cardiovascular disease is regarded as one of the most important subjects in the section of clinical data analysis. The amount of data in the healthcare industry is huge. Data mining turns the large collection of raw healthcare data into information that can help to make informed decisions and predictions.

Coronary Heart Disease (CHD) is the most common type of heart disease, killing over 370,000 people annually. Every year about 735,000 Americans has a heart attack. Of these, 525,000 are a first heart attack and 210,000 happen in people who have already had a heart attack. This makes heart disease a major concern to be dealt with. But it is difficult to identify heart disease because of several risk factors such as diabetes, high blood pressure, high cholesterol, abnormal pulse rate, and many other factors. Because of these factors, scientists have turned towards modern approaches like Data Mining and Machine Learning for predicting the disease.

In this article, I will be applying Data analytics as well as one Machine Learning approach for classifying whether a person is suffering from heart disease or not, using one of the most used dataset — the Cleveland Heart Disease dataset from the UCI Repository.

Importing Libraries:

Imported Libraries

import numpy as np:

NumPy is a python library used for working with arrays. It also has functions for working in the domain of linear algebra, Fourier transform, and matrices. It was created in 2005 by Travis Oliphant. It is an open-source project and you can use it freely. NumPy stands for Numerical Python. In Python, we have lists that serve the purpose of arrays, but they are slow to process. NumPy aims to provide an array object that is up to 50x faster than traditional Python lists. Arrays are very frequently used in data science, where speed and resources are very important.

import pandas as pd

Pandas is an open-source, BSD-licensed Python library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language. Python with Pandas is used in a wide range of fields including academic and commercial domains including finance, economics, statistics, analytics, etc. Python was majorly used for data munging and preparation. It had very little contribution to data analysis. Pandas solved this problem. Using Pandas, we can accomplish five typical steps in the processing and analysis of data, regardless of the origin of data — load, prepare, manipulate, model, and analyze.

import matplotlib.pyplot as plt

Matplotlib is a plotting library for the Python programming language and its numerical mathematics extension NumPy. Pyplot is a Matplotlib module which provides a MATLAB-like interface. Matplotlib is designed to be as usable as MATLAB, with the ability to use Python and the advantage of being free and open-source. Matplotlib is designed to be as usable as MATLAB, with the ability to use Python and the advantage of being free and open-source. It provides an object-oriented API for embedding plots into applications using general-purpose GUI toolkits like Tkinter, wxPython, Qt, or GTK+.

import seaborn as sns

Seaborn is a library for making statistical graphics in Python. It is built on top of matplotlib and closely integrated with pandas data structures. Seaborn aims to make visualization a central part of exploring and understanding data. Its dataset-oriented plotting functions operate on data frames and arrays containing whole datasets and internally perform the necessary semantic mapping and statistical aggregation to produce informative plots.

import os

It is possible to automatically perform many operating system tasks. The OS module in Python provides functions for creating and removing a directory (folder), fetching its contents, changing and identifying the current directory, etc.

import warnings

Warning messages are typically issued in situations where it is useful to alert the user of some condition in a program, where that condition (normally) doesn’t warrant raising an exception and terminating the program. For example, one might want to issue a warning when a program uses an obsolete module. Warning messages are normally written to sys.stderr, but their disposition can be changed flexibly, from ignoring all warnings to turning them into exceptions. The disposition of warnings can vary based on the warning category, the text of the warning message, and the source location where it is issued. Repetitions of a particular warning for the same source location are typically suppressed.

from sklearn.model_selection import train_test_split

Model_selection is a method for setting a blueprint to analyze data and then using it to measure new data. Selecting a proper model allows you to generate accurate results when making a prediction. To do that, you need to train your model by using a specific dataset. Then, you test the model against another dataset. If you have one dataset, you’ll need to split it by using the Sklearn train_test_split function first.

train_test_split is a function in Sklearn model selection for splitting data arrays into two subsets: for training data and for testing data. With this function, you don’t need to divide the dataset manually. By default, Sklearn train_test_split will make random partitions for the two subsets. However, you can also specify a random state for the operation.

from sklearn.metrics import accuracy_score

Accuracy classification score. In multilabel classification, this function computes subset accuracy: the set of labels predicted for a sample must exactly match the corresponding set of labels in y_true.

from sklearn.linear_model import LogisticRegression

Logistic Regression is a Machine Learning classification algorithm that is used to predict the probability of a categorical dependent variable. In logistic regression, the dependent variable is a binary variable that contains data coded as 1 (yes, success, etc.) or 0 (no, failure, etc.). In other words, the logistic regression model predicts P(Y=1) as a function of X.

The above information is about all the libraries that were imported and some basic things what are they used for.

Now as we have imported the required Libraries, let us dive deep into the fun stuff.

Loading dataset and checking the first five rows.

hearts = pd.read_csv(“hearts.csv”)

read_csv is an inbuilt function of Pandas library which allows us to read .csv files and here I initialized it to hearts variable.

hearts.head()

The head() function is used to get the first n rows. This function returns the first n rows for the object based on position. It is useful for quickly testing if your object has the right type of data in it. By default, n=5.

Output after describe function is used.

hearts.describe()

The describe() function computes a summary of statistics pertaining to the Data Frame columns. This function gives the mean, std, and IQR values. And, function excludes the character columns and given a summary about numeric columns.

Output after .info() is used.

hearts.info()

The info() function is used to print a concise summary of a Data Frame. This method prints information about a Data Frame including the index dtype and column dtypes, non-null values, and memory usage. The above output clearly shows that there are no values missing.

For a better understanding of the given data.

The above code helps us in analyzing the columns in a better way.

EXPLORATORY DATA ANALYSIS (EDA)

tar = hearts[“target”]

Here tar variable is assigned to targets columns

hearts[“target”].unique()

Here we use the unique function which prints out all the unique values in the given variable of the dataset.

target_temp = hearts.target.value_counts()

The above code calculates the total values in the target variable.

print(target_temp)

And this code prints the values.

Here “1” is the number of people suffering from heart disease and “0” is the number of people who are not suffering from heart disease. Hence the number of people suffering from heart disease is “165” and the number of people not suffering from heart disease is “138”.

Clearly, from this, we can assume that this is a classification problem with target variables having values “0” and “1”.

sns.countplot(tar)

Countplot shows the counts of observations in each categorical bin using bars.

Calculation percent wise.

Here we find the percentage of people that are suffering and the people who are not suffering from heart disease and they are 45.54% and 54.45% respectively.

hearts[“sex”].unique()

Here we use the unique function which prints out all the unique values in the given variable of the dataset.

The output of the above code.

We have two features

Here “1” is denoted for the number of males and “0” is for the number of females.

Countplot to show the number of males vs the number of females.

sns.countplot(hearts[“sex”])

Countplot shows the counts of observations in each categorical bin using bars. Here, by this count plot, we can see that number of females is less as compared to the number of males.

Men suffering from heart disease vs female suffering from heart disease

sns.barplot(hearts[“sex”],tar)

A bar plot represents an estimate of central tendency for a numeric variable with the height of each rectangle and provides some indication of the uncertainty around that estimate using error bars.

From the above barplot, we can easily see that the proportion of females suffering from heart disease is more than that of males.

Unique Feautres of cp.

hearts[“cp”].unique()

We have four features

Here type “0” is for typical anginal, type “1” is for atypical anginal, type “3” is for non-anginal, and type “4” is for asymptotic.

Countplot to show the different types of cps.

sns.countplot(hearts[“cp”])

Countplot shows the counts of observations in each categorical bin using bars. Here, by this count plot, we can see that most of the patients have typical anginal chest pain whereas very few patients suffer from an asymptotic type of chest pain.

Comparing the Vulnerability of heart disease from different types of chest pains

From the above barplot, we can easily see that people having typical anginal pain are much less likely to have heart problems as compared to the rest of the three.

The above code gives us the two features regarding persons who have fasting blood sugar for 1 it is >120 mg/dl and for 0 it is <120 mg/dl.

count plot for fbs

sns.countplot(hearts[“fbs”])

Here we see people having fbs >120mg/dl i.e. “0” is very high as compared to people who are having fbs<120 mg/dl.

fbs effect on the heart problem

From the above barplot, we can clearly see that fbs does not have much effect on heart problem.

Total features of “restecg”

We have three features,

They are type “0”, type “1” and type “2”.

count plot for restecg

sns.countplot(hearts[“restecg”])

Here we can clearly see people having type “0” and type “1” is almost the same whereas people having type “2” is extremely low as compared to type “0” and type “1”.

restecg effect on the heart problem

From the above barplot, we can easily see that people having type “2” are much less likely to have heart problems as compared to type “0” and type “1”.

Features for “exang”

exang is Exercise-induced angina we have two features here. “0” if for people not having exang and “1” is for people having exang.

count plot for exang

Here we can clearly see people having type “0” is more than type “1”.

effect of “exang” on heart problems

From the above barplot, we can easily see that people having type “1” are much less likely to have heart problems as compared to type “0”.

Features for “slope”

We have three features

They are slope “0”, slope “1” and slope “2”.

count plot for slope

Here we can clearly see people having slope “1” and slope “2” is much more than slope “0”.

effect of slope on heart problems

From the above barplot, we can easily see that people having slope “2” have much more heart problems as compared to slope “0” and slope “1”.

Features for “ca”

We have five features

They are type “0”, type “1”, type “2”, type “3”, and type “4”.

count plot for ca

From the above countplot we can see that people having ca=0 are extremely high in number as compared to the rest of the ca’s.

effect of ca on heart problems

Here we see that people having ca=4 have a very high number of heart problems. As compared to the rest of the people.

Features of “thal”

We have four features

They are type “0”, type “1”, type “2”, and type “3”.

count plot for “thal”

From the above count plot, we can see that people having thal as type “2” is very as compared to the rest of the group.

effect of thal on heart problems

From the above barplot, we can clearly see that type “0” has a high chance of having a heart problem.

Here we have completed EDA now its time to move on prediction using machine learning models.

Now the above code is used for Machine Learning purposes. We split the data into train and test set keeping in mind that the approach should not overfit or underfit on the given data. 20% of the data is taken for testing while 80% is being used for training the model.

Here we print the shape of the train and test set i.e. the dimension of train and test set.

Here we apply Logistic regression to the first training set and from that using the .predict method we calculate y_pred i.e. the prediction of y using x_test.

Finally comparing the results of y_pred with y_test we get that accuracy of the Logistic Regression is about 85.25%.

CONCLUSION

Heart diseases are one of the major concerns of society and the number of people affected by these diseases is increasing day by day and it is important to find a solution to this problem.

It is difficult to manually determine the odds of getting heart disease based on risk factors. But with the help of data analytics and machine learning models, we can determine these diseases and have a better chance of treating it.

--

--