Use of Machine Learning in Predicting Cardiovascular Disease

Ronok Ghosal
6 min readMar 19, 2023

--

By: Ronok Ghosal

Westlake High School, Austin, TX

Basic Understandings:

Entropy is a mathematical concept that plays a crucial role in many areas of computer science, including information theory, coding theory, and machine learning. Entropy is most commonly integrated into applications that revolve around making predictions. In information theory, entropy is a measure of the amount of uncertainty or randomness in a message. Entropy(commonly mathematically denoted as H) is defined as the expected value of the self-information, which is the negative logarithm of the probability of a message.

The higher the entropy, the more uncertainty there is in the message. In coding theory, entropy is used to measure the efficiency of a code. A code is efficient if it can represent a message with a low number of bits. The lower the entropy, the purer the collection of samples, with an entropy of 0 depicting a pure or complete categorized group.

Entropy can also be used to make predictions in certain situations. For example, in weather forecasting, entropy can be used to predict the probability of different weather events occurring. Similarly, in financial forecasting, entropy can be used to predict the likelihood of varying stock price movements. Entropy is a mathematical concept that plays a vital role in many areas of computer science. It is used to measure uncertainty, the efficiency of codes, the impurity of sets of samples, and to make predictions in various contexts.

In my project, I used entropy to design an ID3 logistical model to predict if a patient suffers from a cardiovascular illness, given their symptoms. To understand how entropy was used in the project, an understanding of how ID3 works is essential. The ID3 algorithm revolves around the tree data structure. ID3 chooses features that will produce the largest reduction in entropy. It chooses features that will split the data in such a way that the resulting subsets are as pure(consisting of the optimal amounts of homogeneous elements) as possible.

The act of splitting a set into the purest subsets in terms of entropy is known as information gain or minimal entropy. IG(S,A) = H(S) — H(S|A) where S represents the set of all instances in the dataset, A represents an attribute of the instances, and H(S|A) represents the conditional entropy of S given the attribute A.

IG = H(parent set) — (Weighted Average)(H(child sets))

The reason for the utilization of maximum gain is that a pure subset will result in a more accurate prediction when it is used to make a decision at a leaf node in the tree, therefore resulting in more accurate predictions generated by the ID3 algorithm.

The Project:

Training

The first step for the project was to train the ID3 model. From a publicly released dataset, I was able to obtain health metrics of 70,000 patients consisting of characteristics like — height, weight, age, gender, cholesterol levels, glucose levels, systolic BP, diastolic BP, etc… to train the ID3 tree. The dataset consisted of labeled data, meaning that from the symptoms it was revealed whether or not the subjects had a cardiovascular disease.With optimal entropy reductions, the ID3 algorithm was able to create the ideal treeset that predicted from patient symptoms, whether or not the patient had a dieasese.

Testing

In order to test the accuracy and legitimacy of my ID3 decision tree, I used another labeled dataset consisting of 100,000 patients including the same categorical variables used in training(height, weight, age, gender, cholesterol levels, glucose levels, systolic BP, diastolic BP, etc..). My algorithm returned around 95,700 accurate predictions on whether or not a patient suffering from cardiovascular disease from their symptoms. From this single dataset I moved on to larger datasets, consisting of nearly 300,000 labeled subjects, and again the ID3 algorithm created a decision tree with a near 96% accuracy rate.

Key Takeaways:

This has been my first ML project, and I learned my first ML algorithm — ID3. I was very surprised how such a structured and organized, yet simple-to-understand process returned such an accurate output. In this project, I learned the magnitude of the power of basic data structures like trees — used in ID3 prediction, matrices — used in reading/splitting datasets, and linked lists — used in implementing the tree sets.

Sources

--

--